Building Telugu Corpus with Telugu Wikipedia data, second attempt

The earlier attempt to create a Telugu Wikipedia dump resulted in not only corpus but gave me a ready reckoner Wikipedia articles at hand.

I could access Wikipedia articles one text file per article and it helped me run various string processing commands on these per page wiki article pages and showed me many new insights.

Otherwise difficult to find out, many disambiguity pages could be easily found out.

Here, today, I tried to use a method mentioned at kdnuggets. The procedure was simple, a set of two scripts was given. The first script would take xml dump and convert it into a single text file with space seperated words of Telugu Wikipedia dump. There is no taxonomy, no distinction, the entire text of Wikipedia is dumped into a single text file.

The procedure can be checked here.

Leave a comment

Filed under geek, project

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s