Conceptually, this is simple. If we use "clicking" but don't have an entry for it, but if the root (stem) of "clicking" is "click" and we do have an entry for that, we probably don't need to list "clicking" as a word we don't have.
As Connel pointed out in the Beer parlour (discussion now moved to the Grease pit), a program for extracting word roots is the "Porter Stemmer". It's not perfect, but it's a place to start.
I don't know what kind of platform you're using, RJFJR. I'm using Unix (well, actually, Mac OS X at the moment), and it excels at writing scripts to automate tasks like this one. I don't know if it will be of use to you, but here's a reasonably straightforward script for using the "PorterStemmer" program to cull derived words from your list:
ifile=$1 # input file (list of missing words) allw=./allwords # file of all entries in Wiktionary # Note: this script assumes single words, # i.e. it won't work for multi-word entries with spaces in them tf1=/tmp/tf$$.1 tf2=/tmp/tf$$.2 tf3=/tmp/tf$$.3 # Run "PorterStemmer" program to create list of stems ./PorterStemmer $ifile > $tf1 # paste words and stems together (side-by-side) to make a 2-column file # (will be used later to map missing stems back to words) paste $ifile $tf1 | sort +1 > $tf2 # extract second column (sorted stems), # use comm to select those not present in list of all words awk '{print $2}' $tf2 | sort -u | comm -23 - $allw > $tf3 # having list of stems not present, go back and correlate with # words from which those stems were derived join -1 1 -2 2 -o 2.1 $tf3 $tf2 | sort -u rm $tf1 $tf2 $tf3
This presumes that:
cc -o PorterStemmer PorterStemmer.c
.sort -o allwords allwords
if not.)If the script is in the file "RJFJRcull.sh", usage is simply
sh RJFJRcull.sh missing_words
where missing_words
is (obviously enough) the file of missing words. The script spits out a culled version of the input list, which you can capture by doing
sh RJFJRcull.sh missing_words > culled_missing_words
or whatever. (Apologies if I've belabored the obvious here; I don't know whether you know anything about Unix sh programming or not.)
I forgot to mention: One small problem with the PorterStemmer program as written is that it seems to convert everything to lower case. So if you've got a candidate undefined word Porter, it stems that to port, which we have, so Porter goes off the "missing words" list. This is obviously fixable, but I'm not too worried about it just now, because the effect is small, and it's not as if we're overpruning the "missing words" list down to nothingness such that there's nothing left to do. –scs 19:41, 3 June 2006 (UTC)
Can we get a similar concord for Wikipedia? Also, is it possible to get a list of all single-word entries for Wikipedia that do not have corresponding Wiktionary entries? Cheers! bd2412 T 03:25, 30 July 2006 (UTC)
Hey, should we delete blue links, and can I knock out obvious junk like mackenziebot, Cerealkiller, male-X, male-Y, male-Z. Perhaps make a separate list for your program to review so that it can avoid picking those up in the future? bd2412 T 04:36, 3 March 2007 (UTC)
The Google search links next to each word are great (I just used them to get rid of several common misspellings). They return, however, lots of non-article pages. One easy way to get rid of a lot of them would be to append +-Talk%3A to the search line so that it doesn't search User_talk: and Talk: pages.