Category Archives: Random

Fractals and me


I can keep writing on and on about fractals. I can keep talking on and on about fractals.

Such is my interest in fractals. Knowing about fractals and understanding them gave me a very assuring thought that most of the things that we see around are structurally same, the large is made of many tinies, where each of the tiny is identically resembling the large.

Its like saying that Rahman is made of millions and trillions of tiny cells, where each of the cell is structurally same or unique to Rahman!

I may be wrong in putting so, but i might be right as well.

I clearly feel that when we observe an object from atomic level to the visible comprehensible level, somewhere that object represents a fractal.

Once something fits into a fractal, it is perhaps fitting into a mathematical equation. Could be at DNA level, could be at molecular level, could be at organ level, could even be at organism level, fractals are evidently visible.

Physically seeing fractals is one thing, my belief, is that fractals are even there in the way we think. Yes, in inanimate things like thoughts, emotions, every tangible and intangible things around us do have a manifestation of fractals.

Every small emotion we have is part of a bigger emotion and every bigger emotion we keep is built up from small emotions identically same but maybe in smaller intensites.

This is a way of me trying to tackle problems that seem to be very big. At some level any huge problem is not huge but fractally built of tiny problems. And if I could break down that huge problem into many smaller problems, and in turn each smaller problem further into tiny problem, its easy to tackle down. And the solution you come up for such tiny problem, can be amplified to fix the smaller problem and in turn the huge problem.


Nucleosome model DNA quaternary structure, image license : CC-BY-SA, Author : Glwright1

Leave a comment

Filed under about myself, Random

Building Corpus from Telugu Wikipedia Day 2

Ok, so this day 2 came very long after day1.

I used this post from After the deadline to build corpus.

first convert the wikipedia dump xml into individual article files.

and then use various available tools to work with the corpus text files.

Leave a comment

Filed under Random

Building Corpus from Telugu Wikipedia Day 1

Telugu Wikipedia stands as a single largest source of information in Telugu with about 66,691 articles as on date. My attempt would be to create a plain text word corpus from Telugu Wikipedia.

First step is to get data from Telugu Wikipdia.

There are multiple ways to do that. Simplest is to have something like webhttrack and scrape the entire website in html format, the advantage is that it makes a copy of Wikipedia as we see it, sans the internal Wiki code. Many a time, using other methods has skipped text that appears in tables and templates & using html rendered pages does not skip such text. But processing html would be too cumbersome. Also, you cannot be selective with the articles, all the articles might be downloaded without your control, and if the pages are deadend pages and orphan pages, you should forget about them.

One can use pywikibot script and copy each page’s content, here you can be selective about what pages/categories/articles you scrape, but it takes too much of time. Also, several Wikipedia articles’ text database tables are inaccessible to pywikibot.

AWB (Autowikibot) can be designed to load and download specific pages as well, but then, compatibility with utf-8, complex scripts is unexplored, and since I do not use AWB, much is unknown to me about it.

In any which way, Wikipedia and the engineering team at Wikimedia Foundation seem to have heard us. All the pages, text, articles and whatever one may need are extracted time to time and stored in what is called as dumps.

The dump listings woould look something like given below (I will be using tewiki dumps) :

  1. Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream

    • tewiki-20170320-pages-articles-multistream.xml.bz2 92.6 MB
    • tewiki-20170320-pages-articles-multistream-index.txt.bz2 1.1 MB
    • tewiki-20170320-pages-meta-history.xml.bz2
  2. history content of flow pages in xml format. These files contain flow page content in xml format.

    • tewiki-20170320-flowhistory.xml.bz2 12 KB
  3. content of flow pages in xml format. These files contain flow page content in xml format.

    • tewiki-20170320-flow.xml.bz2 10 KB
  4. Log events to all pages and users. This contains the log of actions performed on pages and users.

    • tewiki-20170320-pages-logging.xml.gz 7.1 MB
  5. All pages, current versions only.

    • tewiki-20170320-pages-meta-current.xml.bz2 105.8 MB
  6. Articles, templates, media/file descriptions, and primary meta-pages.

    • tewiki-20170320-pages-articles.xml.bz2 88.6 MB
  7. First-pass for page XML data dumps. These files contain no page text, only revision metadata.

    • tewiki-20170320-stub-meta-history.xml.gz 127.5 MB
    • tewiki-20170320-stub-meta-current.xml.gz 16.0 MB
    • tewiki-20170320-stub-articles.xml.gz 11.1 MB
  8.   Extracted page abstracts for Yahoo: tewiki (ID 3495) 15974 pages (118.4|3173.4/sec all|curr), 15974 revs (118.4|193.5/sec all|curr), ETA  [max 237881]

    • tewiki-20170320-abstract.xml 150.8 MB
  9. List of all page titles

    • tewiki-20170320-all-titles.gz 1.4 MB
  10. List of page titles in main namespace

    • tewiki-20170320-all-titles-in-ns0.gz 619 KB
  11. Namespaces, namespace aliases, magic words.

    • tewiki-20170320-siteinfo-namespaces.json 18 KB
  12. Wiki page-to-page link records.

    • tewiki-20170320-pagelinks.sql.gz 24.6 MB
  13. List of pages’ geographical coordinates

    • tewiki-20170320-geo_tags.sql.gz 106 KB
  14. Name/value pairs for pages.

    • tewiki-20170320-page_props.sql.gz 1.3 MB
  15.   List of annotations (tags) for revisions and log entries

    • tewiki-20170320-change_tag.sql.gz 212 KB
  16.   Wiki category membership link records.

    • tewiki-20170320-categorylinks.sql.gz 6.2 MB
  17.   Wiki external URL link records.

    • tewiki-20170320-externallinks.sql.gz 8.3 MB
  18.   Interwiki link tracking records

    • tewiki-20170320-iwlinks.sql.gz 859 KB
  19.   Nonexistent pages that have been protected.

    • tewiki-20170320-protected_titles.sql.gz 1 KB
  20.   Wiki template inclusion link records.

    • tewiki-20170320-templatelinks.sql.gz 5.4 MB
  21.   Redirect list

    • tewiki-20170320-redirect.sql.gz 340 KB
  22. A few statistics such as the page count.

    • tewiki-20170320-site_stats.sql.gz 801 bytes
  23. User group assignments.

    • tewiki-20170320-user_groups.sql.gz 1 KB
  24. This contains the SiteMatrix information from provided as a table.

    • tewiki-20170320-sites.sql.gz 19 KB
  25. Wiki media/files usage records.

    • tewiki-20170320-imagelinks.sql.gz 2.0 MB
  26. Category information.

    • tewiki-20170320-category.sql.gz 359 KB
  27. Base per-page data (id, title, old restrictions, etc).

    • tewiki-20170320-page.sql.gz 7.0 MB
  28. Newer per-page restrictions table.

    • tewiki-20170320-page_restrictions.sql.gz 3 KB
  29. Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used.

    • tewiki-20170320-wbc_entity_usage.sql.gz 929 KB
  30. Metadata on current versions of uploaded media/files.

    • tewiki-20170320-image.sql.gz 1.6 MB
  31. Wiki interlanguage link records.

    • tewiki-20170320-langlinks.sql.gz 9.2 MB

I will be using 5 above. which is current version of all pages. Pages in Wikipedia could be article pages, article talk pages, user pages, user talk pages, Wikipedia policy, pages, their talk pages, templates, categories, mediawiki script pages and all corresponding talk pages.

I am using Ubuntu 16.04 LTS OS.

I used wget to download the zipped dump, using the command :


The above URL means, that the dump is of Telugu Wikipedia, taken on 2017 March 20.

1 Comment

Filed under Random

Book festival: stall showcasing works of ‘Kavi Samrat’ a highlight – The Hindu

Book festival: stall showcasing works of ‘Kavi Samrat’ a highlight – The Hindu.


విజయవాడ పుస్తకప్రదర్శనలో కవి సామ్రాట్ పుస్తకాలు

Leave a comment

Filed under Random

CS concepts in real time : Dangling Pointer

We always come across many CS concepts in our real life.

I will be posting some of these with a preceding string of CS concepts in real time in title.

I am here not to reinvent the wheel nor to teach you the concept itself.

So, please visit this page and have an overall view on the topic before you move ahead :


I happen to be admin to a few groups at Facebook. And most of these groups have a membership process where any facebook user can come and request to join the group.

When such an incident happens, imagine the user account to be an object. Everything else that is concerned with this object, points to the object. Including the requests sent, pictures, invitations and many more.

So, when a request is sent by one such user, and, before the admin approves it the user deletes/disables his account, then there is an active request or a pointer to the user object but the user himself doesn’t exist; So, now the request becomes a dangling pointer.


1 Comment

Filed under Random

“WARNING: Token not found on wikipedia:” warning on pywikipediabot

WARNING: Token not found on wikipedia:te. You will not be able to edit any page.
Received incomplete XML data. Sleeping for 15 seconds…


Such a warning is because of pywikipedia being outdated and stale. 

So, try updating; also, one is advised to use git repo instead of svn.


Leave a comment

Filed under Random

Interview street Career Gear Programming challenge

Rahimanuddin's Blog

My solution to Interview street Career Gear Programming challenge -2 in PHP :

Question : Given an input string and a specified length, make the string center of a bigger string of length specified length and add stars on both sides.

Sample Input : apple, 21

Sample Output :  ********apple********

$sample = ReadStdin(‘Please input a string: ‘, “apple”);
$length_sp = ReadStdin(‘Please input specified length: ‘, 21);
echo center_string($sample, $length_sp);
function center_string($string, $specified_length) {
$str_len = strlen($string);

if ($str_len > $specified_length) {
return flase;
else if ($str_len == $specified_length) {
return $string;
else {
$starlen = $specified_length – $str_len;
$star_string_fr = str_repeat(“*”, ceil($starlen/2));
$star_string_bc = str_repeat(“*”, floor($starlen/2));
$return = $star_string_fr.$string.$star_string_bc;
return $return;

function ReadStdin($prompt, $valid_inputs, $default = ”) {
while(!isset($input) || (is_array($valid_inputs) && !in_array($input, $valid_inputs)) || ($valid_inputs == ‘is_file’ && !is_file($input))) {
echo $prompt;
$input = strtolower(trim(fgets(STDIN)));
if(empty($input) && !empty($default)) {
$input = $default;

View original post 4 more words

Leave a comment

Filed under Random