User talk:DCDuring/1000EnglishEntries

Hello, you have come here looking for the meaning of the word User talk:DCDuring/1000EnglishEntries. In DICTIOUS you will not only get to know all the dictionary meanings for the word User talk:DCDuring/1000EnglishEntries, but we will also tell you about its etymology, its characteristics and you will know how to say User talk:DCDuring/1000EnglishEntries in singular and plural. Everything you need to know about the word User talk:DCDuring/1000EnglishEntries you have here. The definition of the word User talk:DCDuring/1000EnglishEntries will help you to be more precise and correct when speaking or writing your texts. Knowing the definition ofUser talk:DCDuring/1000EnglishEntries, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.

The code

time \
    bzip2 -d < enwiktionary-20120626-pages-articles.xml.bz2 \
  | perl -w -Mstrict -e \
      'use constant SAMPLE_SIZE => 1000;
       
       $/ = "</page>\n";
       
       my $count;
       my @sample;
       
       while(<>)
       { last unless m/<\/page>/;
         next unless m/<ns>0<\/ns>/;
         next if m/<text xml:space="preserve" \/>/;
         die unless m/<text xml:space="preserve">(+)<\/text>/;
         next unless $1 =~ m/^== *English *== *$/mg;
         die unless m/<title>(+)<\/title>/;
         my $title = $1;
         ++$count;
         if($count <= SAMPLE_SIZE)
           { push @sample, $title; }
         else
         { my $tmp = int rand $count;
           if($tmp < SAMPLE_SIZE)
             { $sample = $title; }
         }
       }
       
       foreach my $column (0 .. 3)
       { print $column == 0 ? "{{top4}}\n" : "{{mid4}}\n";
         my $min_index = int($column * SAMPLE_SIZE / 4);
         my $max_index = int(($column + 1) * SAMPLE_SIZE / 4) - 1;
         foreach my $i ($min_index .. $max_index)
           { print "* #English|$sample]]\n"; }
       }
       print "{{bottom}}\n";
      ' \
  > sample-of-1000-English-entry-titles-20120626.txt

RuakhTALK 19:07, 4 July 2012 (UTC)Reply

Wouldn't this tend to get more pages on the list from toward the end of the the XML file? The 10000th page had a one in ten (1000/10000) chance of getting into @sample (though it might be later ousted); the 1st page, however, would almost certainly be ousted by the time you got to the end of the XML file.
...unless the XML file itself is in pseudorandom order (is it? I have no idea), in which case this works, though then you can just use last if $count > SAMPLE_SIZE; push @sample, $title without the else.​—msh210 (talk) 01:35, 5 July 2012 (UTC)Reply
The XML dump is not in pseudorandom order, no; it seems to be in order by page-ID, which is, to a close approximation, order by when the page was initially created. But I think this algorithm is sound; if there are 1000+n English entries, then each one should have a 1000/(1000+n) chance of being in the sample. Here's a sketch of a proof by induction:
  • If there are 1000+0 = 1000 English entries, then each occurs once in the sample; that is, each has probability 100% = 1000/(1000+0) of being in the sample.
  • Assume that the assertion is true for a specific value of n. Then, what happens if there are 1000+n+1 English entries? Well, first we use the algorithm to create a 1000-member sample Sn out of the first 1000+n of them, such that (by assumption) each of those has probability 1000/(1000+n) of being in Sn. Then, according to the algorithm, we create our sample Sn+1 by giving the last English entry a 1000/(1000+n+1) chance of displacing a randomly-selected member of Sn, or put another way, we give each member of Sn a 1/(1000+n+1) chance of being displaced, that is, a (1000+n)/(1000+n+1) chance of not being displaced. To recap: each of the first 1000+n entries has a 1000/(1000+n) chance of making it into Sn, and then, if it makes it there, a (1000+n)/(1000+n+1) chance of making it into Sn+1. (And of course it has no chance of making it into Sn+1 unless it makes it into Sn.) So its total chance of being included in Sn+1 is 1000/(1000+n) × (1000+n)/(1000+n+1) = 1000/(1000+n+1), Q.E.D.
(Full disclosure: This algorithm is not original with me. I was already familiar with a well-known algorithm for selecting a random line from a file, and all I did was extend it to selecting a fixed-size random subset. I'm certain I'm not the first person to do that; in fact, Google suggests that even this "extended" version might be in Donald Knuth's The Art of Computer Programming, in which case it too is very well known.)
RuakhTALK 02:14, 5 July 2012 (UTC)Reply
The proof looks correct (and elegant)  :-) . Sorry for wasting your time.​—msh210 (talk) 04:31, 5 July 2012 (UTC)Reply
No worries. :-)   —RuakhTALK 11:59, 5 July 2012 (UTC)Reply