Hello, you have come here looking for the meaning of the word
User talk:DCDuring/1000EnglishEntries. In DICTIOUS you will not only get to know all the dictionary meanings for the word
User talk:DCDuring/1000EnglishEntries, but we will also tell you about its etymology, its characteristics and you will know how to say
User talk:DCDuring/1000EnglishEntries in singular and plural. Everything you need to know about the word
User talk:DCDuring/1000EnglishEntries you have here. The definition of the word
User talk:DCDuring/1000EnglishEntries will help you to be more precise and correct when speaking or writing your texts. Knowing the definition of
User talk:DCDuring/1000EnglishEntries, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.
time \
bzip2 -d < enwiktionary-20120626-pages-articles.xml.bz2 \
| perl -w -Mstrict -e \
'use constant SAMPLE_SIZE => 1000;
$/ = "</page>\n";
my $count;
my @sample;
while(<>)
{ last unless m/<\/page>/;
next unless m/<ns>0<\/ns>/;
next if m/<text xml:space="preserve" \/>/;
die unless m/<text xml:space="preserve">(+)<\/text>/;
next unless $1 =~ m/^== *English *== *$/mg;
die unless m/<title>(+)<\/title>/;
my $title = $1;
++$count;
if($count <= SAMPLE_SIZE)
{ push @sample, $title; }
else
{ my $tmp = int rand $count;
if($tmp < SAMPLE_SIZE)
{ $sample = $title; }
}
}
foreach my $column (0 .. 3)
{ print $column == 0 ? "{{top4}}\n" : "{{mid4}}\n";
my $min_index = int($column * SAMPLE_SIZE / 4);
my $max_index = int(($column + 1) * SAMPLE_SIZE / 4) - 1;
foreach my $i ($min_index .. $max_index)
{ print "* #English|$sample]]\n"; }
}
print "{{bottom}}\n";
' \
> sample-of-1000-English-entry-titles-20120626.txt
—RuakhTALK 19:07, 4 July 2012 (UTC)Reply
- Wouldn't this tend to get more pages on the list from toward the end of the the XML file? The 10000th page had a one in ten (1000/10000) chance of getting into
@sample
(though it might be later ousted); the 1st page, however, would almost certainly be ousted by the time you got to the end of the XML file.
- ...unless the XML file itself is in pseudorandom order (is it? I have no idea), in which case this works, though then you can just use
last if $count > SAMPLE_SIZE; push @sample, $title
without the else
.—msh210℠ (talk) 01:35, 5 July 2012 (UTC)Reply
- The XML dump is not in pseudorandom order, no; it seems to be in order by page-ID, which is, to a close approximation, order by when the page was initially created. But I think this algorithm is sound; if there are 1000+n English entries, then each one should have a 1000/(1000+n) chance of being in the sample. Here's a sketch of a proof by induction:
- If there are 1000+0 = 1000 English entries, then each occurs once in the sample; that is, each has probability 100% = 1000/(1000+0) of being in the sample.
- Assume that the assertion is true for a specific value of n. Then, what happens if there are 1000+n+1 English entries? Well, first we use the algorithm to create a 1000-member sample Sn out of the first 1000+n of them, such that (by assumption) each of those has probability 1000/(1000+n) of being in Sn. Then, according to the algorithm, we create our sample Sn+1 by giving the last English entry a 1000/(1000+n+1) chance of displacing a randomly-selected member of Sn, or put another way, we give each member of Sn a 1/(1000+n+1) chance of being displaced, that is, a (1000+n)/(1000+n+1) chance of not being displaced. To recap: each of the first 1000+n entries has a 1000/(1000+n) chance of making it into Sn, and then, if it makes it there, a (1000+n)/(1000+n+1) chance of making it into Sn+1. (And of course it has no chance of making it into Sn+1 unless it makes it into Sn.) So its total chance of being included in Sn+1 is 1000/(1000+n) × (1000+n)/(1000+n+1) = 1000/(1000+n+1), Q.E.D.
- (Full disclosure: This algorithm is not original with me. I was already familiar with a well-known algorithm for selecting a random line from a file, and all I did was extend it to selecting a fixed-size random subset. I'm certain I'm not the first person to do that; in fact, Google suggests that even this "extended" version might be in Donald Knuth's The Art of Computer Programming, in which case it too is very well known.)
- —RuakhTALK 02:14, 5 July 2012 (UTC)Reply
- The proof looks correct (and elegant)
:-)
. Sorry for wasting your time.—msh210℠ (talk) 04:31, 5 July 2012 (UTC)Reply
- No worries. :-) —RuakhTALK 11:59, 5 July 2012 (UTC)Reply