Hello, you have come here looking for the meaning of the word
User:Conrad.Bot/Indexing. In DICTIOUS you will not only get to know all the dictionary meanings for the word
User:Conrad.Bot/Indexing, but we will also tell you about its etymology, its characteristics and you will know how to say
User:Conrad.Bot/Indexing in singular and plural. Everything you need to know about the word
User:Conrad.Bot/Indexing you have here. The definition of the word
User:Conrad.Bot/Indexing will help you to be more precise and correct when speaking or writing your texts. Knowing the definition of
User:Conrad.Bot/Indexing, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.
This page may be out of date, but it should accurately reflect the current status when it was updated.
Languages
- On multiple pages: Hungarian, Irish, Italian, Spanish, Galician, Ancient Greek, English, Lithuanian
- On one page:Mapudungun, Hiligaynon
Overview
create_indices.sh
Downloads the latest XML dump from http://devtionary.info/w/dump/xmlu and then runs the following programs.
nicen.dump.awk
Normalize the XML dump, removing entries I am uninterested in, and formatting those that I am more readably
extract_words.awk
Scan through the dump and add every entry that contains at least one definition that doesn't look like a "form of" definition to a list. This step also stores any audio files it finds, as well as noting whether the link will need a #Language as it is not the first section on the page.
- Entries whose only definition line consists entirely of a template (except
{{SI unit}}
and {{given name}}
) are excluded
- Definitions start with "compound of" are excluded
- Definitions that contain variations on X form of, where X is present/perfect/plural/singular/past historic/preterite/compound/ending in ive are excluded.
- This is of course guess work, and if you notice words that should be in the index, but aren't, or words that shouldn't be in the index but are, let me know.
get_trans.py
Scan through the dump and add every translation of words in languages that are being indexed, and add them to the lists created in 2.
- This looks for any line starting with "*<Language name>:"
- It discards everything in (brackets).
- It will include anything in a
{{t}}
template or {{l}}
template.
- It will include any remaining links.
- If the entire line looks like a valid term, then it will include the whole line.
get_missing.py
(For some languages) scan through the current index for that language and add all words there to the list as "missing".
split_index.<language name>.pl
Split the list for each language into files for each starting letter, corresponding to the list of entries on each page, and (for newly added languages) sort them, and divide them by second letter.
format_index.<language name>.pl
Format the per-letter lists into wikitext (for the older few languages, the sorting and splitting by second letter happens here).
indexupload.py
Upload each formatted output file
Sorting and splitting
For all languages, the strings are first normalized to lowercase. As I get round to it, I intend to rewrite the old-style ones as new style ones.
Ancient Greek
- Remove all space and punctuation.
- Treat any remaining non-alphabetic and
𐠀ϝϻϡϙ
as 0
.
- Remove diacritics.
- Split on first two characters.
- Use
el_EL.utf-8
to sort original strings.
Galician
- (old style)
- Remove all diacritics (except
ñ
).
- Treat non-alphabetic characters as
0
.
- Split on first two characters.
- Sort on normalised form.
Hungarian
- (old style)
- Replace
á é í ó ú ő ű
with a e i o u ö ü
- Treat non-alphanumeric characters as
0
- Split on fist two
(cs|gy|ly|ny|sz|ty|zs|0])
- Sort on normalised form.
Irish
- Remove all space and punctuation.
- Treat any remaining non-alphabetic as
0
.
- Remove diacritics.
- Remove any leading
an
.
- Split on first two characters.
- Use
gl_GL.utf-8
to sort original string.
Italian
- (old style)
- Remove all diacritics.
- Remove any leading
a
.
- Treat non-alphabetic characters as
0
.
- Split on first two characters.
- Sort on normalised form.
Spanish
- Remove all space and punctuation.
- Treat any remaining non-alphabetic as
0
.
- Split on first two
(ñ|ll|ch|0])
- Use
es_ES.utf-8
to sort original string.
Currently all languages are treated about the same:
- Strikethrough links that were added as "missing" from the inde<xes
- Add an
{{audio-list}}
for an audio file, if one was found.
- Abbreviate PoS and add that in italics.
- Add an * linked to any entries which contained the word.
- Add #<language name> to links that were not the first on the page.
- Put the lists (#-lists) into a
<div class="index"></div>
seperated by ===-headings and a table of contents.
- This means that the lists run horizontally, this means that they can change width to fill the maximum amount of space available to them, and that users can continue scrolling downwards without having to go up to find the next column.