Wiktionary:Babel | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
Search user languages or scripts |
An application (command line) that parse enwiktionary dump file and returns an error each time it encounter a format incongruence (both in layout and in content).
Well, at the moment is just on my HDD but if you are interested in looking at the code I'll upload to github (coded in Go language).
Some time ago I bought an electronic translator (I needed something portable and offline), it was supposed to be good (no names) but it let a lot to be desired. While looking for some alternative I found wiktionary (I knew wikipedia of course but not wiktionary itself) and I thought that what I was looking for, why buying a better translator when on wiktionary if something is missing can be simply added once and for all? It's sound way more efficient! Moreover I could extract the data I needed from the dump file to have something usable offline. So I wrote a short command line application to extract translations from enwiktionary (something like User:Matthias_Buchmeier effort) but I soon realized that formatting is not exactly a constant here, so my original idea became to code an application to find problems and, at the same time, to write a unified formatting guide.
The application (command line) requires a dump file (I'm currently testing on enwiktionary-20120505-pages-articles.xml.bz2) and return a list of errors and miscellaneous statistics.
Errors, and output in general, include:
The list is a WIP itself and will be more detailed in the future, anyway should give an idea of the direction of the project.
(online and pdf version) TODO
All sections under this heading are comments, doubts that need clarifications and wild brainstorming so feel free to comment, Thank you! Application is currently tested on enwiktionary-20120505-pages-articles.xml.bz2 so it could be different from actual status of wikitionary.
Wiktionary:Beer_parlour#A_question_on_redirects
(NTS) Misuse of context labels (October 2012): context usage.
I'm trying to define a format for entries under Alternative forms heading (nothing is defined in Wiktionary:ELE). Currently, over a total of 59863 entries the formatting breakdown is:
* ] {{qualifier}}?
* {{l}} {{qualifier}}?
the rest are templates, wikified terms and plain text combined in various ways: {{term}}, {{sense}}, {{l-nn}}, {{l-nb}}, {{l}}, {{forms}}, {{nn-inf}}, {{pedlink}}, {{zh-ts}}, {{R:Webster 1913}}, {{alternative form of}}, {{seeCites}}, {{soplink}}
I think only formats that result in the same output (one wikified term per line with optional qualifier) should be allowed, so the two main one (which one should be the preferred one?) and similar cases: {{l-nn}}, {{l-nb}} that are Norwegian versions of {{l}}.
translations should be short and "template only" (IMHO)
It would be great (from a validator point of view) to have only combinations of {{t}}
and {{qualifier}}
templates, with commas, semicolons and newlines (*::) to separate/group them. However some cases are excluded (such as the example in http://en.wiktionary.orghttps://dictious.com/en/Template:t) so the idea needs some work...:
* Arabic: {{t|ar|فراشة|sc=Arab|f|tr=fará:sha}}; (fertito) ''(Morocco)''; (fartattu) {{italbrac|Tunisia}}
(Fedso (talk) 20:49, 27 June 2012 (UTC))
{{el-p}}
(only Greek), some {{onym}}
, others have individually wiki-linked words. Anyhow IMHO the practice of putting transliterations in round brackets is no good idea as in that case there is no easy way to (automatically) identify them as a transliteration.Matthias Buchmeier (talk) 11:14, 28 June 2012 (UTC)
If multiline is not necessary (or not to be used, possibly messy if {{trans-mid}}
end up in the wrong place) a semicolon could be used instead of a newline (as in yours#Translations -> possessive pronoun -> Italian)
I'd like to see a single way to format translations, I like the "* language, *: sublanguage, *:: newline" format, as in cousin#Translations -> "nephew or niece of a parent" -> Chinese. Only the language line uses a bullet list and definitions are one per line for complex cases (as in Mandarin) and single line for simple cases (as in Min Nan)(Fedso (talk) 20:49, 27 June 2012 (UTC))
{{proto}}
(NTS) Grease pit: Template:recons or Template:proto? (proto deprecated)
There are about 15000 {{DEFAULTSORT}}, isn't possible to automatically index terms in categories with diacritics removed, instead of having to add {{DEFAULTSORT}} in each page?
Are (''text''), {{italbrac|text}} and {{qualifier|text}}
equivalent? wouldn't be better to use only one?(Fedso (talk) 20:49, 27 June 2012 (UTC))
{{qualifier}}
should be used, yes. {{italbrac}}
has been deleted. - -sche (discuss) 00:44, 28 June 2012 (UTC)