UPSC Mains Exam paper analyses

Here I will share UPSC mains exam 2017 papers analyses.

Keep watching…

Leave a comment

Filed under Random

Building Corpus from Telugu Wikipedia Day 1

Telugu Wikipedia stands as a single largest source of information in Telugu with about 66,691 articles as on date. My attempt would be to create a plain text word corpus from Telugu Wikipedia.

First step is to get data from Telugu Wikipdia.

There are multiple ways to do that. Simplest is to have something like webhttrack and scrape the entire website in html format, the advantage is that it makes a copy of Wikipedia as we see it, sans the internal Wiki code. Many a time, using other methods has skipped text that appears in tables and templates & using html rendered pages does not skip such text. But processing html would be too cumbersome. Also, you cannot be selective with the articles, all the articles might be downloaded without your control, and if the pages are deadend pages and orphan pages, you should forget about them.

One can use pywikibot script and copy each page’s content, here you can be selective about what pages/categories/articles you scrape, but it takes too much of time. Also, several Wikipedia articles’ text database tables are inaccessible to pywikibot.

AWB (Autowikibot) can be designed to load and download specific pages as well, but then, compatibility with utf-8, complex scripts is unexplored, and since I do not use AWB, much is unknown to me about it.

In any which way, Wikipedia and the engineering team at Wikimedia Foundation seem to have heard us. All the pages, text, articles and whatever one may need are extracted time to time and stored in what is called as dumps.

The dump listings woould look something like given below (I will be using tewiki dumps) :

  1. Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream

    • tewiki-20170320-pages-articles-multistream.xml.bz2 92.6 MB
    • tewiki-20170320-pages-articles-multistream-index.txt.bz2 1.1 MB
    • tewiki-20170320-pages-meta-history.xml.bz2
  2. history content of flow pages in xml format. These files contain flow page content in xml format.

    • tewiki-20170320-flowhistory.xml.bz2 12 KB
  3. content of flow pages in xml format. These files contain flow page content in xml format.

    • tewiki-20170320-flow.xml.bz2 10 KB
  4. Log events to all pages and users. This contains the log of actions performed on pages and users.

    • tewiki-20170320-pages-logging.xml.gz 7.1 MB
  5. All pages, current versions only.

    • tewiki-20170320-pages-meta-current.xml.bz2 105.8 MB
  6. Articles, templates, media/file descriptions, and primary meta-pages.

    • tewiki-20170320-pages-articles.xml.bz2 88.6 MB
  7. First-pass for page XML data dumps. These files contain no page text, only revision metadata.

    • tewiki-20170320-stub-meta-history.xml.gz 127.5 MB
    • tewiki-20170320-stub-meta-current.xml.gz 16.0 MB
    • tewiki-20170320-stub-articles.xml.gz 11.1 MB
  8.   Extracted page abstracts for Yahoo: tewiki (ID 3495) 15974 pages (118.4|3173.4/sec all|curr), 15974 revs (118.4|193.5/sec all|curr), ETA  [max 237881]

    • tewiki-20170320-abstract.xml 150.8 MB
  9. List of all page titles

    • tewiki-20170320-all-titles.gz 1.4 MB
  10. List of page titles in main namespace

    • tewiki-20170320-all-titles-in-ns0.gz 619 KB
  11. Namespaces, namespace aliases, magic words.

    • tewiki-20170320-siteinfo-namespaces.json 18 KB
  12. Wiki page-to-page link records.

    • tewiki-20170320-pagelinks.sql.gz 24.6 MB
  13. List of pages’ geographical coordinates

    • tewiki-20170320-geo_tags.sql.gz 106 KB
  14. Name/value pairs for pages.

    • tewiki-20170320-page_props.sql.gz 1.3 MB
  15.   List of annotations (tags) for revisions and log entries

    • tewiki-20170320-change_tag.sql.gz 212 KB
  16.   Wiki category membership link records.

    • tewiki-20170320-categorylinks.sql.gz 6.2 MB
  17.   Wiki external URL link records.

    • tewiki-20170320-externallinks.sql.gz 8.3 MB
  18.   Interwiki link tracking records

    • tewiki-20170320-iwlinks.sql.gz 859 KB
  19.   Nonexistent pages that have been protected.

    • tewiki-20170320-protected_titles.sql.gz 1 KB
  20.   Wiki template inclusion link records.

    • tewiki-20170320-templatelinks.sql.gz 5.4 MB
  21.   Redirect list

    • tewiki-20170320-redirect.sql.gz 340 KB
  22. A few statistics such as the page count.

    • tewiki-20170320-site_stats.sql.gz 801 bytes
  23. User group assignments.

    • tewiki-20170320-user_groups.sql.gz 1 KB
  24. This contains the SiteMatrix information from meta.wikimedia.org provided as a table.

    • tewiki-20170320-sites.sql.gz 19 KB
  25. Wiki media/files usage records.

    • tewiki-20170320-imagelinks.sql.gz 2.0 MB
  26. Category information.

    • tewiki-20170320-category.sql.gz 359 KB
  27. Base per-page data (id, title, old restrictions, etc).

    • tewiki-20170320-page.sql.gz 7.0 MB
  28. Newer per-page restrictions table.

    • tewiki-20170320-page_restrictions.sql.gz 3 KB
  29. Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used.

    • tewiki-20170320-wbc_entity_usage.sql.gz 929 KB
  30. Metadata on current versions of uploaded media/files.

    • tewiki-20170320-image.sql.gz 1.6 MB
  31. Wiki interlanguage link records.

    • tewiki-20170320-langlinks.sql.gz 9.2 MB

I will be using 5 above. which is current version of all pages. Pages in Wikipedia could be article pages, article talk pages, user pages, user talk pages, Wikipedia policy, pages, their talk pages, templates, categories, mediawiki script pages and all corresponding talk pages.

I am using Ubuntu 16.04 LTS OS.

I used wget to download the zipped dump, using the command :

wget https://dumps.wikimedia.org/tewiki/20170320/tewiki-20170320-pages-meta-current.xml.bz2

The above URL means, that the dump is of Telugu Wikipedia, taken on 2017 March 20.

Leave a comment

Filed under Random

Book festival: stall showcasing works of ‘Kavi Samrat’ a highlight – The Hindu

Book festival: stall showcasing works of ‘Kavi Samrat’ a highlight – The Hindu.

 

విజయవాడ పుస్తకప్రదర్శనలో కవి సామ్రాట్ పుస్తకాలు

Leave a comment

Filed under Random

CS concepts in real time : Dangling Pointer

We always come across many CS concepts in our real life.

I will be posting some of these with a preceding string of CS concepts in real time in title.

I am here not to reinvent the wheel nor to teach you the concept itself.

So, please visit this page and have an overall view on the topic before you move ahead : https://en.wikipedia.org/wiki/Dangling_pointer

So,

I happen to be admin to a few groups at Facebook. And most of these groups have a membership process where any facebook user can come and request to join the group.

When such an incident happens, imagine the user account to be an object. Everything else that is concerned with this object, points to the object. Including the requests sent, pictures, invitations and many more.

So, when a request is sent by one such user, and, before the admin approves it the user deletes/disables his account, then there is an active request or a pointer to the user object but the user himself doesn’t exist; So, now the request becomes a dangling pointer.

 

1 Comment

Filed under Random

“WARNING: Token not found on wikipedia:” warning on pywikipediabot

WARNING: Token not found on wikipedia:te. You will not be able to edit any page.
Received incomplete XML data. Sleeping for 15 seconds…

 

Such a warning is because of pywikipedia being outdated and stale. 

So, try updating; also, one is advised to use git repo instead of svn.

Check https://www.mediawiki.org/wiki/Manual:Pywikipediabot/Gerrit 

Leave a comment

Filed under Random

Interview street Career Gear Programming challenge

Rahimanuddin's Blog

My solution to Interview street Career Gear Programming challenge -2 in PHP :

Question : Given an input string and a specified length, make the string center of a bigger string of length specified length and add stars on both sides.

Sample Input : apple, 21

Sample Output :  ********apple********

<?php
$sample = ReadStdin(‘Please input a string: ‘, “apple”);
$length_sp = ReadStdin(‘Please input specified length: ‘, 21);
echo center_string($sample, $length_sp);
function center_string($string, $specified_length) {
$str_len = strlen($string);

if ($str_len > $specified_length) {
return flase;
}
else if ($str_len == $specified_length) {
return $string;
}
else {
$starlen = $specified_length – $str_len;
$star_string_fr = str_repeat(“*”, ceil($starlen/2));
$star_string_bc = str_repeat(“*”, floor($starlen/2));
$return = $star_string_fr.$string.$star_string_bc;
return $return;
}
}

function ReadStdin($prompt, $valid_inputs, $default = ”) {
while(!isset($input) || (is_array($valid_inputs) && !in_array($input, $valid_inputs)) || ($valid_inputs == ‘is_file’ && !is_file($input))) {
echo $prompt;
$input = strtolower(trim(fgets(STDIN)));
if(empty($input) && !empty($default)) {
$input = $default;
}
}
return…

View original post 4 more words

Leave a comment

Filed under Random

Interview street Career Gear Programming challenge

My solution to Interview street Career Gear Programming challenge -2 in PHP :

Question : Given an input string and a specified length, make the string center of a bigger string of length specified length and add stars on both sides.

Sample Input : apple, 21

Sample Output :  ********apple********

<?php
$sample = ReadStdin(‘Please input a string: ‘, “apple”);
$length_sp = ReadStdin(‘Please input specified length: ‘, 21);
echo center_string($sample, $length_sp);
function center_string($string, $specified_length) {
$str_len = strlen($string);

if ($str_len > $specified_length) {
return flase;
}
else if ($str_len == $specified_length) {
return $string;
}
else {
$starlen = $specified_length – $str_len;
$star_string_fr = str_repeat(“*”, ceil($starlen/2));
$star_string_bc = str_repeat(“*”, floor($starlen/2));
$return = $star_string_fr.$string.$star_string_bc;
return $return;
}
}

function ReadStdin($prompt, $valid_inputs, $default = ”) {
while(!isset($input) || (is_array($valid_inputs) && !in_array($input, $valid_inputs)) || ($valid_inputs == ‘is_file’ && !is_file($input))) {
echo $prompt;
$input = strtolower(trim(fgets(STDIN)));
if(empty($input) && !empty($default)) {
$input = $default;
}
}
return $input;
}

 
?>

1 Comment

Filed under Random