This is a Wiktionary policy, guideline or common practices page. Specifically it is a policy think tank, working to develop a formal policy. | |
Policies – Entries: CFI - EL - NORM - NPOV - QUOTE - REDIR - DELETE. Languages: LT - AXX. Others: BLOCK - BOTS - VOTES. |
Wiktionary includes many words in many languages. This page details the conventions and practices relating to the variety of languages on Wiktionary.
To distinguish languages, Wiktionary gives each a unique name and a unique code, which identify it. Other information is also collected.
Wiktionary calls each language it includes by a distinct name. This name is used in headers, translation tables, categories, appendices, and some other places. Most languages only have one name, but some may be known by multiple names. In this case, one of the language's names is chosen for use in Wiktionary. This name is referred to as the canonical name of the language. Canonical language names are chosen by consensus. Whenever possible, common English names of languages are used, and diacritics are avoided. Attested names (names which meet WT:CFI) are strongly preferred.
Canonical names must be unique, meaning that a name must refer to at most one language. When two or more languages are commonly known by the same name, Wiktionary distinguishes them by choosing different canonical names for each one, using a variety of means:
pyx
) on Wiktionary, to distinguish it from the language of Papua New Guinea which is called "Pyu" (code: pby
).ria
) goes by the name "Reang" on Wiktionary, to distinguish it from the "Riang" of Burma/Myanmar (code: ril
).bwu
) and "Buli (Indonesia)" (code: bzq
).mhz
) and "Mor (Papuan)" (code: moq
), both of which are spoken in Indonesia.Each language on Wiktionary also has a unique code assigned to it, usually consisting of two or three letters. This code is used to identify languages when including templates in entries. Language names are not used in this case because they are longer and less precise, as the above section illustrates. Topical categories also use the language code as part of their names.
The list of standard language codes can be found at Wiktionary:List of languages and the list of special language codes, including etymology-only languages, can be found at the subpage Wiktionary:List of languages/special.
Wiktionary chooses codes for languages as follows, in order of priority:
sh
.mul
is used.roa-gal
: "roa
" is the ISO 639-5 code for Romance languages, "gal
" abbreviates "Gallo".
map-bms
for Banyumasan (the Banyumasan Wikipedia is map-bms.wikipedia.org), so Wiktionary also represents Banyumasan using this code. If the Wikimedia code is of a different form, it is not used by Wiktionary; for example, Tarantino has the Wikimedia code roa-tara
, but the Wiktionary code roa-tar
.mis
is used: for example, Kassite is represented by the code mis-kas
.qsb
is used rather than qfa-sub
.-pro
" added to the end: Proto-Germanic, for example, is represented by the code gem-pro
. Because the entire family code is used as the first part of the code, the code may be longer than seven characters: for example, Proto-Mixe-Zoque is nai-miz-pro
.Not all lects which have been assigned codes by the ISO are assigned codes or included by Wiktionary. This is the case for some constructed languages, for example. There are also many lects which the ISO has assigned codes which are not treated as distinct languages on Wiktionary. For example, the ISO assigned Moldovan/Moldavian the 639-1 code mo
, but Wiktionary regards it as a form of Romanian and represents it and Romanian by the same code ro
. See Wiktionary:Language treatment for more information.
In a small number of cases, there is a mismatch between the (typically ISO-derived) code used by Wiktionary to represent a language and the code used by the Wikimedia Foundation. For example, Aromanian is represented on Wiktionary and in ISO 639-3 by the code rup
, but the WMF uses the code roa-rup
and locates the Aromanian Wikipedia at roa-rup.wikipedia.org. The templates such as Template:wikipedia which Wiktionary uses to link to its sister projects accept only Wiktionary codes. To enable linking to projects (such as the Aromanian Wikipedia) for which the WMF uses special codes, Module:wikimedia languages maps Wiktionary codes to Wikimedia codes, and Module:languages performs the reverse mapping.
Wiktionary sorts languages into families. Most families are related through descent from a common ancestor, but a few are merely categories, such as "creoles and pidgins". Wiktionary records which family a language belongs to in the data modules of Module:languages. Like languages, families are represented by unique codes and have unique canonical names.
gmw
).zls
).alg
).azc-nah
).Some languages are not naturally descended from other languages, but show other origins. These use special types of families:
art
).crp
).Wiktionary records which script(s) (writing systems) a language is written in as well. This information is primarily used by modules to be able to automatically detect and format non-Latin-alphabet text appropriately. Scripts, too, have unique codes and canonical names.
Latn
).Latn
and Cyrl
).Every language has a main category which contains all terms that the English Wiktionary has for that language. This category is named using the canonical name of the language, followed by the word "language". For example, the main category for English is Category:English language. If the canonical name of the language already ends in the word "language", nothing is added (hence Category:American Sign Language).
The main category for a language will have a variety of subcategories, which organise terms in various ways. The most important is the "lemma" category tree, which organises all lemmas in a language by their part of speech. As Wiktionary is always being expanded and improved upon, not all languages have their own categories yet, and certain subcategories may still be empty or missing. Categories are created as needed, when new entries are added to them. When content is added in a language lacking a category, it can simply be created using the {{auto cat}}
template, as long as the name follows the standard format used by other languages.
Languages generally also have a page which contains information that is useful to users who want to create or edit entries in that language. This page is named "Wiktionary:About (canonical name of language)", for example Wiktionary:English entry guidelines or Wiktionary:About Spanish. These pages contain a wide variety of information, depending on what other editors have found useful to note. They may explain which templates to use, specific conventions regarding spelling, pronunciation or transliteration, and more. By convention, a shortcut redirect is created to these pages for easy access, named WT:A(language code). For example, WT:AEN redirects to Wiktionary:About English (for which the code is en
).
Templates and modules use a system for storing and retrieving the various pieces of information that may be associated with a language. The module Module:languages is used to retrieve all language-related information from other modules. This module cannot be used directly in a template, so instead there is another module named Module:languages/templates, which allows templates to access the information.
An overview of all basic information about a language, such as its canonical name, alternative names, code, family or scripts, can be looked up at Wiktionary:List of languages (or WT:LL for short). This is useful if you need to look up the code for a particular language, or need to know what the canonical name of a language is.
The data itself is not stored in Module:languages, but instead is contained in a number of data modules (see Category:Language data modules).
For instructions on how to edit this information, see the documentation of any of the data modules.
Some lects (e.g. dialects, chronolects and topolects) are given their own language codes and can be used in many types of templates in place of full language codes, but don't have their own L2 language entries. An example is Classical Persian, which is given a code fa-cls
, but whose entries are listed under the ==Persian== header (corresponding to language code fa
). The term "etymology-only language" was originally appropriate, as these lects could generally only be used in etymology templates such as {{inh}}
, {{bor}}
and {{der}}
, but their use has now been expanded well beyond these templates, and the term "etymology-only language" is now considered a misnomer. There is consensus (established in the Beer Parlour) to rename them to "language varieties", but this has not been done yet (as of June 2024).
The full list of etymology-only codes can be found in Wiktionary:List of languages/special#Etymology-only languages, and the source module that describes them is Module:etymology languages/data.