Telugu Wikipedia stands as a single largest source of information in Telugu with about 66,691 articles as on date. My attempt would be to create a plain text word corpus from Telugu Wikipedia.
First step is to get data from Telugu Wikipdia.
There are multiple ways to do that. Simplest is to have something like webhttrack and scrape the entire website in html format, the advantage is that it makes a copy of Wikipedia as we see it, sans the internal Wiki code. Many a time, using other methods has skipped text that appears in tables and templates & using html rendered pages does not skip such text. But processing html would be too cumbersome. Also, you cannot be selective with the articles, all the articles might be downloaded without your control, and if the pages are deadend pages and orphan pages, you should forget about them.
One can use pywikibot script and copy each page’s content, here you can be selective about what pages/categories/articles you scrape, but it takes too much of time. Also, several Wikipedia articles’ text database tables are inaccessible to pywikibot.
AWB (Autowikibot) can be designed to load and download specific pages as well, but then, compatibility with utf-8, complex scripts is unexplored, and since I do not use AWB, much is unknown to me about it.
In any which way, Wikipedia and the engineering team at Wikimedia Foundation seem to have heard us. All the pages, text, articles and whatever one may need are extracted time to time and stored in what is called as dumps.
The dump listings woould look something like given below (I will be using tewiki dumps) :
-
Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream
- tewiki-20170320-pages-articles-multistream.xml.bz2 92.6 MB
- tewiki-20170320-pages-articles-multistream-index.txt.bz2 1.1 MB
- tewiki-20170320-pages-meta-history.xml.bz2
-
history content of flow pages in xml format. These files contain flow page content in xml format.
- tewiki-20170320-flowhistory.xml.bz2 12 KB
-
content of flow pages in xml format. These files contain flow page content in xml format.
- tewiki-20170320-flow.xml.bz2 10 KB
-
Log events to all pages and users. This contains the log of actions performed on pages and users.
- tewiki-20170320-pages-logging.xml.gz 7.1 MB
-
All pages, current versions only.
- tewiki-20170320-pages-meta-current.xml.bz2 105.8 MB
-
Articles, templates, media/file descriptions, and primary meta-pages.
- tewiki-20170320-pages-articles.xml.bz2 88.6 MB
-
First-pass for page XML data dumps. These files contain no page text, only revision metadata.
- tewiki-20170320-stub-meta-history.xml.gz 127.5 MB
- tewiki-20170320-stub-meta-current.xml.gz 16.0 MB
- tewiki-20170320-stub-articles.xml.gz 11.1 MB
-
Extracted page abstracts for Yahoo: tewiki (ID 3495) 15974 pages (118.4|3173.4/sec all|curr), 15974 revs (118.4|193.5/sec all|curr), ETA [max 237881]
- tewiki-20170320-abstract.xml 150.8 MB
-
List of all page titles
- tewiki-20170320-all-titles.gz 1.4 MB
-
List of page titles in main namespace
- tewiki-20170320-all-titles-in-ns0.gz 619 KB
-
Namespaces, namespace aliases, magic words.
- tewiki-20170320-siteinfo-namespaces.json 18 KB
-
Wiki page-to-page link records.
- tewiki-20170320-pagelinks.sql.gz 24.6 MB
-
List of pages’ geographical coordinates
- tewiki-20170320-geo_tags.sql.gz 106 KB
-
Name/value pairs for pages.
- tewiki-20170320-page_props.sql.gz 1.3 MB
-
List of annotations (tags) for revisions and log entries
- tewiki-20170320-change_tag.sql.gz 212 KB
-
Wiki category membership link records.
- tewiki-20170320-categorylinks.sql.gz 6.2 MB
-
Wiki external URL link records.
- tewiki-20170320-externallinks.sql.gz 8.3 MB
-
Interwiki link tracking records
- tewiki-20170320-iwlinks.sql.gz 859 KB
-
Nonexistent pages that have been protected.
- tewiki-20170320-protected_titles.sql.gz 1 KB
-
Wiki template inclusion link records.
- tewiki-20170320-templatelinks.sql.gz 5.4 MB
-
Redirect list
- tewiki-20170320-redirect.sql.gz 340 KB
-
A few statistics such as the page count.
- tewiki-20170320-site_stats.sql.gz 801 bytes
-
User group assignments.
- tewiki-20170320-user_groups.sql.gz 1 KB
-
This contains the SiteMatrix information from meta.wikimedia.org provided as a table.
- tewiki-20170320-sites.sql.gz 19 KB
-
Wiki media/files usage records.
- tewiki-20170320-imagelinks.sql.gz 2.0 MB
-
Category information.
- tewiki-20170320-category.sql.gz 359 KB
-
Base per-page data (id, title, old restrictions, etc).
- tewiki-20170320-page.sql.gz 7.0 MB
-
Newer per-page restrictions table.
- tewiki-20170320-page_restrictions.sql.gz 3 KB
-
Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used.
- tewiki-20170320-wbc_entity_usage.sql.gz 929 KB
-
Metadata on current versions of uploaded media/files.
- tewiki-20170320-image.sql.gz 1.6 MB
-
Wiki interlanguage link records.
- tewiki-20170320-langlinks.sql.gz 9.2 MB
I will be using 5 above. which is current version of all pages. Pages in Wikipedia could be article pages, article talk pages, user pages, user talk pages, Wikipedia policy, pages, their talk pages, templates, categories, mediawiki script pages and all corresponding talk pages.
I am using Ubuntu 16.04 LTS OS.
I used wget to download the zipped dump, using the command :
wget https://dumps.wikimedia.org/tewiki/20170320/tewiki-20170320-pages-meta-current.xml.bz2
The above URL means, that the dump is of Telugu Wikipedia, taken on 2017 March 20.