Building Corpus from Telugu Wikipedia Day 2

Ok, so this day 2 came very long after day1.

I used this post from After the deadline to build corpus.

first convert the wikipedia dump xml into individual article files.

and then use various available tools to work with the corpus text files.

Leave a comment

Filed under Random

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s