This module contains definitions and metadata for two-letter language codes. See Wiktionary:Languages for more information.
This module must not be used directly in other modules or templates. The data should be accessed through Module:languages. For the corresponding extra data, see Module:languages/data/2/extra.
The following errors were detected by Module:data consistency check:
nb
) has Middle Norwegian language (gmq-mno
) set as an ancestor, but is not in the West Scandinavian family (gmq-wes
).nb
) has Danish language (da
) set as an ancestor, but is not in the East Scandinavian family (gmq-eas
).hns
) has Bhojpuri language (bho
) set as an ancestor, but is not in the Bihari family (inc-bih
).hns
) has Awadhi language (awa
) set as an ancestor, but is not in the Eastern Hindi family (inc-hie
).alv-gtm-pro
) does not have the expected name "Proto-Ghana-Togo Mountain", even though it is the proto-language of the Ghana-Togo Mountain languages (alv-gtm
).auf-pro
) does not have the expected name "Proto-Arauan", even though it is the proto-language of the Arauan languages (auf
).awd-amc-pro
) has a proto-language code associated with the invalid code awd-amc
.awd-kmp-pro
) has a proto-language code associated with the invalid code awd-kmp
.awd-pro
) does not have the expected name "Proto-Arawakan", even though it is the proto-language of the Arawakan languages (awd
).awd-prw-pro
) has a proto-language code associated with the invalid code awd-prw
.awd-taa-pro
) does not have the expected name "Proto-Ta-Arawakan", even though it is the proto-language of the Ta-Arawakan languages (awd-taa
).dru-pro
) has a proto-language code associated with Rukai (dru
), which is not a family.euq-pro
) does not have the expected name "Proto-Vasconic", even though it is the proto-language of the Vasconic languages (euq
).gmq-pro
) does not have the expected name "Proto-North Germanic", even though it is the proto-language of the North Germanic languages (gmq
).inc-krn-pro
) does not have the expected name "Proto-KRNB lects", even though it is the proto-language of the KRNB lects (inc-krn
).mis-hkl
, is repeated in the table of aliases
.nai-chu-pro
) does not have the expected name "Proto-Chumashan", even though it is the proto-language of the Chumashan languages (nai-chu
).nai-mdu-pro
) does not have the expected name "Proto-Maiduan", even though it is the proto-language of the Maiduan languages (nai-mdu
).nai-miz-pro
) does not have the expected name "Proto-Mixe-Zoquean", even though it is the proto-language of the Mixe-Zoquean languages (nai-miz
).nai-pom-pro
) does not have the expected name "Proto-Pomoan", even though it is the proto-language of the Pomoan languages (nai-pom
).omq-maz-pro
) does not have the expected name "Proto-Mazatecan", even though it is the proto-language of the Mazatecan languages (omq-maz
).os-pro
) has a proto-language code associated with Ossetian (os
), which is not a family.poz-swa-pro
) does not have the expected name "Proto-North Sarawakan", even though it is the proto-language of the North Sarawakan languages (poz-swa
).sal-pro
) does not have the expected name "Proto-Salishan", even though it is the proto-language of the Salishan languages (sal
).smi-pro
) does not have the expected name "Proto-Sami", even though it is the proto-language of the Sami languages (smi
).tbq-kuk-pro
) does not have the expected name "Proto-Kukish", even though it is the proto-language of the Kukish languages (tbq-kuk
).xsc-sak-pro
) does not have the expected name "Proto-Sakan", even though it is the proto-language of the Sakan languages (xsc-sak
).xsc-sar-pro
) has a proto-language code associated with the invalid code xsc-sar
.lzh-lit
) has a canonical name that is not unique; it is also used by the code lzh
.preprocess_links
for ??? (th-new
) is invalid.inc-old
) has no child families or languages.lzh-lit
, is wrong; it should be Literary Chinese.lzh-lit
, is wrong; it should be Literary Chinese.ira-mid
and the canonical name Middle Iranian should be removed; they are not found in Module:families/data.ira-old
and the canonical name Old Iranian should be removed; they are not found in Module:families/data.ira-mid
and the canonical name Middle Iranian should be removed; they are not found in Module:families/data.ira-old
and the canonical name Old Iranian should be removed; they are not found in Module:families/data.Every entry in the table must contain the following indexed fields:
1
2
Q
and ends with decimal digits. Set to nil
if not known/present. This replaces the older wikipedia_article
property, which can still be used to link to specific sections or language editions.3
4
Language:findBestScript
method in Module:languages. This function goes down the list of scripts and counts how many characters in the text belong to each script. If all the characters belong to one script, that script will be returned; otherwise, the script with the most characters will be returned. Thus, script detection will be faster if the most frequently used scripts are first in the list. If none of the characters match any of the listed scripts, then the None
script is returned (even if the characters would match a script not listed). Translingual (mul
) and Undetermined (und
) have the special value "All"
, which means they are treated as having every script. This value should not be set for any other language codes."Latn, Brai, Shaw, Dsrt"
.type
regular
- This value is the default, so it doesn't need to be specified. It indicates that the is attested according to WT:CFI and therefore permitted in the main namespace. There may also be reconstructed terms for the language, which are placed in the Reconstruction namespace and must be prefixed with * to indicate a reconstruction.reconstructed
- This language is not attested according to CFI, and therefore is allowed only in the Reconstruction namespace. All terms in this language are reconstructed, and must be prefixed with *.appendix-constructed
- This language is attested but does not meet the additional requirements set out for constructed languages (WT:CFI#Constructed languages). Its entries must therefore be in the Appendix namespace, but they are not reconstructed and therefore should not have * prefixed in links.ancestors
enm
(Middle English); ang
(Old English, the ancestor of Middle English), gem-pro
(Proto-Germanic, the ancestor of Old English), and ine-pro
(Proto-Indo-European, the ancestor of Proto-Germanic) are not listed.gem-pro
) belongs to the Indo-European (ine
) family, and its direct ancestor is Proto-Indo-European (ine-pro
). Because Proto-Indo-European is the proto-language of the Indo-European languages, Proto-Germanic does not need an ancestors
table; Proto-Indo-European will be automatically returned as its ancestor by the getAncestors
function."cr, fr"
.wikimedia_codes
"en, simple"
.interwiki_langs
in Module:translations/data; and the wiktprefix
field of the `metadata` variable in MediaWiki:Gadget-TranslationAdder-Data.js. FIXME: Unify this data.wikipedia_article
translit
isTransliterated
value set to false
in Module:scripts/data. This is used by transliterate
in Module:languages.link_tr
true
to link the language's transliteration. For instance, Gothic has entries in Gothic script and entries for transliterations: 𐌷𐌻𐌰𐌹𐌱𐍃 (hlaibs). Otherwise, this can be a comma-separated list of script codes, which means that links are only applied to terms using those scripts.override_translit
true
to make the automatic transliteration override an any given manual transliteration. Otherwise, this can be a comma-separated list of script codes, which means that the override is only applied to terms using those scripts.display_text
ӏ
, used in Cyrillic in many Caucasian languages, is frequently entered as I
, or even Latin l
or I
. As this is an ongoing issue (even among native speakers), the easiest way to solve the problem is to automatically correct the display form for those languages. This is used by makeDisplayText
in Module:languages.entry_name
ру́сский
→ русский
), or macrons from Latin or Old English words (ōs
→ os
), as these are not used in the normal written form of these languages. This is used by makeEntryName
in Module:languages.sort_key
"у" .. p
. Another character could be inserted straight after by using "у" .. p
(and so on).makeSortKey
in Module:languages.dotted_dotless_i
true
for languages that distinguish between the dotted and dotless I (such as some Turkic languages).translit
, display_text
, entry_name
and sort_key
all use the same syntax, which is designed to be as flexible as possible:
"sa-translit"
refers to Module:sa-translit.from
, to
, remove_diacritics
and remove_exceptions
relate to text substitution (see below).1
can be used as a fallback, which will be used if no specific behaviour is defined for that script.1
if you want to avoid this. It is not possible to process the output of a script-specific module with another module, however: this should be done (for example) with a tail call in the first module.text, lang, sc
, where text
is the input text (usually the page name or input by the user), lang
is the language code (not the language object), and sc
is the script code (not the script object). For performance reasons, they should only be used when it is not possible to achieve the desired result via text substitution.from
and to
keys.remove_diacritics
(and optionally remove_exceptions
).from
is paired with to
, and both of them must be tables that are organised pairwise: each element in from
is a pattern to identify which characters in the term to replace, while the corresponding element in to
defines what to replace them with (as arguments to mw.ustring.gsub
).false
or nil
), then any matching characters are removed altogether. This means that the from
list can be longer than the to
list, and an empty replacement will be assumed for any elements in from
that have no counterpart in to
.mw.ustring.gsub
function. See the Scribunto reference manual for more information. Note that patterns make double substitutions a viable way to achieve more complex results. See the Latin sortkey for Mandarin (cmn
) as an example of this.remove_diacritics
is a string which contains characters that will be removed after the text is decomposed. For instance, if remove_diacritics
is a combining acute accent, all acute accents will be stripped, even if they are part of precomposed characters (such as á or ά). Despite the name, the characters to be stripped need not be diacritics: for instance, including an apostrophe would remove all apostrophes (though be careful with hyphens, which must be be escaped as %-
).remove_diacritics
is given, then it is possible to specify a remove_exceptions
table, which prevents specific characters from having their diacritics stripped. For instance, if remove_diacritics
is a combining diaeresis, but remove_exceptions
contains "ё"
, then any instances of ё
will remain unchanged. On the other hand, an instance of ӱ
would still become у
(unless "ӱ"
is also added to remove_exceptions
).aliases
, varieties
, otherNames
family
3
.scripts
4
.local m_lang = require("Module:languages")
local m_langdata = require("Module:languages/data")
local u = require("Module:string utilities").char
local c = m_langdata.chars
local p = m_langdata.puaChars
local s = m_langdata.shared
-- Ideally, we want to move these into ], but because (a) it's necessary to use require on that module, and (b) they're only used in this data module, it's less memory-efficient to do that at the moment. If it becomes possible to use mw.loadData, then these should be moved there.
s = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.macron .. c.dacute .. c.caron .. c.cedilla,
remove_exceptions = {"å"},
from = {"æ", "ø", "å"},
to = {"z" .. p, "z" .. p, "z" .. p}
}
s = "AaBbDdEeFfGgHhIiJjKkLlMmNnOoPpRrSsTtUuVvYyÆæØøÅå" .. c.punc
local m = {}
m = {
"Afar",
27811,
"cus-eas",
"Latn, Ethi",
entry_name = {Latn = {remove_diacritics = c.acute}},
}
m = {
"Abkhaz",
5111,
"cau-abz",
"Cyrl, Geor, Latn",
translit = {
Cyrl = "ab-translit",
Geor = "Geor-translit",
},
override_translit = true,
display_text = {Cyrl = s},
entry_name = {
Cyrl = {
remove_diacritics = c.acute,
from = {"^а%-"},
to = {"а"},
},
Latn = s,
},
sort_key = {
Cyrl = {
from = {
"х'ә", -- 3 chars
"гь", "гә", "ӷь", "ҕь", "ӷә", "ҕә", "дә", "ё", "жь", "жә", "ҙә", "ӡә", "ӡ'", "кь", "кә", "қь", "қә", "ҟь", "ҟә", "ҫә", "тә", "ҭә", "ф'", "хь", "хә", "х'", "ҳә", "ць", "цә", "ц'", "ҵә", "ҵ'", "шь", "шә", "џь", -- 2 chars
"ӷ", "ҕ", "ҙ", "ӡ", "қ", "ҟ", "ԥ", "ҧ", "ҫ", "ҭ", "ҳ", "ҵ", "ҷ", "ҽ", "ҿ", "ҩ", "џ", "ә", -- 1 char
"^а",
},
to = {
"х" .. p,
"г" .. p, "г" .. p, "г" .. p, "г" .. p, "г" .. p, "г" .. p, "д" .. p, "е" .. p, "ж" .. p, "ж" .. p, "з" .. p, "з" .. p, "з" .. p, "к" .. p, "к" .. p, "к" .. p, "к" .. p, "к" .. p, "к" .. p, "с" .. p, "т" .. p, "т" .. p, "ф" .. p, "х" .. p, "х" .. p, "х" .. p, "х" .. p, "ц" .. p, "ц" .. p, "ц" .. p, "ц" .. p, "ц" .. p, "ш" .. p, "ш" .. p, "ы" .. p,
"г" .. p, "г" .. p, "з" .. p, "з" .. p, "к" .. p, "к" .. p, "п" .. p, "п" .. p, "с" .. p, "т" .. p, "х" .. p, "ц" .. p, "ч" .. p, "ч" .. p, "ч" .. p, "ы" .. p, "ы" .. p, "ь" .. p,
"",
}
},
},
}
m = {
"Avestan",
29572,
"ira-cen",
"Avst, Gujr",
translit = {Avst = "Avst-translit"},
}
m = {
"Afrikaans",
14196,
"gmw-frk",
"Latn, Arab",
ancestors = "nl",
sort_key = {
Latn = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.diaer .. c.ringabove .. c.cedilla .. "'",
from = {"n"},
to = {"n" .. p}
}
},
}
m = {
"Akan",
28026,
"alv-ctn",
"Latn",
}
m = {
"Amharic",
28244,
"sem-eth",
"Ethi",
translit = "Ethi-translit",
}
m = {
"Aragonese",
8765,
"roa-ibe",
"Latn",
ancestors = "roa-oan",
}
m = {
"Arabic",
13955,
"sem-arb",
"Arab, Hebr, Syrc, Brai",
translit = {Arab = "ar-translit"},
entry_name = {Arab = "ar-entryname"},
-- put Judeo-Arabic (Hebrew-script Arabic) under the category header
-- U+FB21 HEBREW LETTER WIDE ALEF so that it sorts after Arabic script titles
sort_key = {
Hebr = {
from = {"^%f"},
to = {u(0xFB21)},
},
},
}
m = {
"Assamese",
29401,
"inc-bas",
"as-Beng",
ancestors = "inc-mas",
translit = "as-translit",
}
m = {
"Avar",
29561,
"cau-ava",
"Cyrl, Latn, Arab",
ancestors = "oav",
translit = {
Cyrl = "cau-nec-translit",
Arab = "ar-translit",
},
override_translit = true,
display_text = {Cyrl = s},
entry_name = {
Cyrl = s,
Latn = s,
},
sort_key = {
Cyrl = {
from = {"гъ", "гь", "гӏ", "ё", "кк", "къ", "кь", "кӏ", "лъ", "лӏ", "тӏ", "хх", "хъ", "хь", "хӏ", "цӏ", "чӏ"},
to = {"г" .. p, "г" .. p, "г" .. p, "е" .. p, "к" .. p, "к" .. p, "к" .. p, "к" .. p, "л" .. p, "л" .. p, "т" .. p, "х" .. p, "х" .. p, "х" .. p, "х" .. p, "ц" .. p, "ч" .. p}
},
},
}
m = {
"Aymara",
4627,
"sai-aym",
"Latn",
}
m = {
"Azerbaijani",
9292,
"trk-ogz",
"Latn, Cyrl, fa-Arab",
ancestors = "trk-oat",
dotted_dotless_i = true,
entry_name = {
Latn = {
from = {"ʼ"},
to = {"'"},
},
= {
module = "ar-entryname",
= {
"ۆ",
"ۇ",
"وْ",
"ڲ",
"ؽ",
},
= {
"و",
"و",
"و",
"گ",
"ی",
},
},
},
display_text = {
Latn = {
from = {"'"},
to = {"ʼ"}
}
},
sort_key = {
Latn = {
from = {
"i", -- Ensure "i" comes after "ı".
"ç", "ə", "ğ", "x", "ı", "q", "ö", "ş", "ü", "w"
},
to = {
"i" .. p,
"c" .. p, "e" .. p, "g" .. p, "h" .. p, "i", "k" .. p, "o" .. p, "s" .. p, "u" .. p, "z" .. p
}
},
Cyrl = {
from = {"ғ", "ә", "ы", "ј", "ҝ", "ө", "ү", "һ", "ҹ"},
to = {"г" .. p, "е" .. p, "и" .. p, "и" .. p, "к" .. p, "о" .. p, "у" .. p, "х" .. p, "ч" .. p}
},
},
}
m = {
"Bashkir",
13389,
"trk-kbu",
"Cyrl",
translit = "ba-translit",
override_translit = true,
sort_key = {
from = {"ғ", "ҙ", "ё", "ҡ", "ң", "ө", "ҫ", "ү", "һ", "ә"},
to = {"г" .. p, "д" .. p, "е" .. p, "к" .. p, "н" .. p, "о" .. p, "с" .. p, "у" .. p, "х" .. p, "э" .. p}
},
}
m = {
"Belarusian",
9091,
"zle",
"Cyrl, Latn",
ancestors = "zle-obe",
translit = {Cyrl = "be-translit"},
entry_name = {
Cyrl = {
remove_diacritics = c.grave .. c.acute,
},
Latn = {
remove_diacritics = c.grave .. c.acute,
remove_exceptions = {"Ć", "ć", "Ń", "ń", "Ś", "ś", "Ź", "ź"},
},
},
sort_key = {
Cyrl = {
remove_diacritics = c.grave .. c.acute,
from = {"ґ", "ё", "і", "ў"},
to = {"г" .. p, "е" .. p, "и" .. p, "у" .. p}
},
Latn = {
remove_diacritics = c.grave .. c.acute,
remove_exceptions = {"Ć", "ć", "Ń", "ń", "Ś", "ś", "Ź", "ź"},
from = {"ć", "č", "dz", "dź", "dž", "ch", "ł", "ń", "ś", "š", "ŭ", "ź", "ž"},
to = {"c" .. p, "c" .. p, "d" .. p, "d" .. p, "d" .. p, "h" .. p, "l" .. p, "n" .. p, "s" .. p, "s" .. p, "u" .. p, "z" .. p, "z" .. p}
},
},
standardChars = {
Cyrl = "АаБбВвГгДдЕеЁёЖжЗзІіЙйКкЛлМмНнОоПпРрСсТтУуЎўФфХхЦцЧчШшЫыЬьЭэЮюЯя",
Latn = "AaBbCcĆćČčDdEeFfGgHhIiJjKkLlŁłMmNnŃńOoPpRrSsŚśŠšTtUuŬŭVvYyZzŹźŽž",
(c.punc:gsub("'", "")) -- Exclude apostrophe.
},
}
m = {
"Bulgarian",
7918,
"zls",
"Cyrl",
ancestors = "cu-bgm",
translit = "bg-translit",
entry_name = {
remove_diacritics = c.grave .. c.acute,
remove_exceptions = {"%fѝ%f"},
},
standardChars = "АаБбВвГгДдЕеЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЬьЮюЯя" .. c.punc,
}
m = {
"Bihari",
135305,
"inc-eas",
"Deva",
}
m = {
"Bislama",
35452,
"crp",
"Latn",
ancestors = "en",
}
m = {
"Bambara",
33243,
"dmn-emn",
"Latn, Nkoo",
sort_key = {
Latn = {
from = {"ɛ", "ɲ", "ŋ", "ɔ"},
to = {"e" .. p, "n" .. p, "n" .. p, "o" .. p}
},
},
}
m = {
"Bengali",
9610,
"inc-bas",
"Beng, Newa",
ancestors = "inc-mbn",
translit = {Beng = "bn-translit"},
}
m = {
"Tibetan",
34271,
"sit-tib",
"Tibt", -- sometimes Deva?
ancestors = "xct",
translit = "Tibt-translit",
override_translit = true,
display_text = s,
entry_name = s,
sort_key = "Tibt-sortkey",
}
m = {
"Breton",
12107,
"cel-brs",
"Latn",
ancestors = "xbm",
sort_key = {
from = {"ch", "ch"},
to = {"c" .. p, "c" .. p}
},
}
m = {
"Catalan",
7026,
"roa-ocr",
"Latn",
ancestors = "roa-oca",
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.diaer .. c.cedilla,
from = {"l·l"},
to = {"ll"}
},
standardChars = "AaÀàBbCcÇçDdEeÉéÈèFfGgHhIiÍíÏïJjLlMmNnOoÓóÒòPpQqRrSsTtUuÚúÜüVvXxYyZz·" .. c.punc,
}
m = {
"Chechen",
33350,
"cau-vay",
"Cyrl, Latn, Arab",
translit = {
Cyrl = "cau-nec-translit",
Arab = "ar-translit",
},
override_translit = true,
display_text = {Cyrl = s},
entry_name = {
Cyrl = s,
Latn = s,
},
sort_key = {
Cyrl = {
from = {"аь", "гӏ", "ё", "кх", "къ", "кӏ", "оь", "пӏ", "тӏ", "уь", "хь", "хӏ", "цӏ", "чӏ", "юь", "яь"},
to = {"а" .. p, "г" .. p, "е" .. p, "к" .. p, "к" .. p, "к" .. p, "о" .. p, "п" .. p, "т" .. p, "у" .. p, "х" .. p, "х" .. p, "ц" .. p, "ч" .. p, "ю" .. p, "я" .. p}
},
},
}
m = {
"Chamorro",
33262,
"poz",
"Latn",
sort_key = {
remove_diacritics = "'",
from = {"å", "ch", "ñ", "ng"},
to = {"a" .. p, "c" .. p, "n" .. p, "n" .. p}
},
}
m = {
"Corsican",
33111,
"roa-itd",
"Latn",
sort_key = {
from = {"chj", "ghj", "sc", "sg"},
to = {"c" .. p, "g" .. p, "s" .. p, "s" .. p}
},
standardChars = "AaÀàBbCcDdEeÈèFfGgHhIiÌìÏïJjLlMmNnOoÒòPpQqRrSsTtUuÙùÜüVvZz" .. c.punc,
}
m = {
"Cree",
33390,
"alg",
"Latn, Cans",
translit = {Cans = "cr-translit"},
}
m = {
"Czech",
9056,
"zlw",
"Latn",
ancestors = "cs-ear",
sort_key = {
from = {"á", "č", "ď", "é", "ě", "ch", "í", "ň", "ó", "ř", "š", "ť", "ú", "ů", "ý", "ž"},
to = {"a" .. p, "c" .. p, "d" .. p, "e" .. p, "e" .. p, "h" .. p, "i" .. p, "n" .. p, "o" .. p, "r" .. p, "s" .. p, "t" .. p, "u" .. p, "u" .. p, "y" .. p, "z" .. p}
},
standardChars = "AaÁáBbCcČčDdĎďEeÉéĚěFfGgHhIiÍíJjKkLlMmNnŇňOoÓóPpRrŘřSsŠšTtŤťUuÚúŮůVvYyÝýZzŽž" .. c.punc,
}
m = {
"Old Church Slavonic",
35499,
"zls",
"Cyrs, Glag",
translit = {Cyrs = "Cyrs-translit", Glag = "Glag-translit"},
entry_name = {Cyrs = s},
sort_key = {Cyrs = s},
}
m = {
"Chuvash",
33348,
"trk-ogr",
"Cyrl",
ancestors = "cv-mid",
translit = "cv-translit",
override_translit = true,
sort_key = {
from = {"ӑ", "ё", "ӗ", "ҫ", "ӳ"},
to = {"а" .. p, "е" .. p, "е" .. p, "с" .. p, "у" .. p}
},
}
m = {
"Welsh",
9309,
"cel-brw",
"Latn",
ancestors = "wlm",
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.diaer .. "'",
from = {"ch", "dd", "ff", "ng", "ll", "ph", "rh", "th"},
to = {"c" .. p, "d" .. p, "f" .. p, "g" .. p, "l" .. p, "p" .. p, "r" .. p, "t" .. p}
},
standardChars = "ÂâAaBbCcDdEeÊêFfGgHhIiÎîLlMmNnOoÔôPpRrSsTtUuÛûWwŴŵYyŶŷ" .. c.punc,
}
m = {
"Danish",
9035,
"gmq-eas",
"Latn",
ancestors = "gmq-oda",
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.macron .. c.dacute .. c.caron .. c.cedilla,
remove_exceptions = {"å"},
from = {"æ", "ø", "å"},
to = {"z" .. p, "z" .. p, "z" .. p}
},
standardChars = "AaBbDdEeFfGgHhIiJjKkLlMmNnOoPpRrSsTtUuVvYyÆæØøÅå" .. c.punc,
}
m = {
"German",
188,
"gmw-hgm",
"Latn, Latf",
ancestors = "gmh",
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.diaer .. c.ringabove,
from = {"æ", "œ", "ß"},
to = {"ae", "oe", "ss"}
},
standardChars = "AaÄäBbCcDdEeFfGgHhIiJjKkLlMmNnOoÖöPpQqRrSsẞßTtUuÜüVvWwXxYyZz" .. c.punc,
}
m = {
"Dhivehi",
32656,
"inc-ins",
"Thaa, Diak",
translit = {
Thaa = "dv-translit",
Diak = "Diak-translit",
},
override_translit = true,
}
m = {
"Dzongkha",
33081,
"sit-tib",
"Tibt",
ancestors = "xct",
translit = "Tibt-translit",
override_translit = true,
display_text = s,
entry_name = s,
sort_key = "Tibt-sortkey",
}
m = {
"Ewe",
30005,
"alv-gbe",
"Latn",
sort_key = {
remove_diacritics = c.tilde,
from = {"ɖ", "dz", "ɛ", "ƒ", "gb", "ɣ", "kp", "ny", "ŋ", "ɔ", "ts", "ʋ"},
to = {"d" .. p, "d" .. p, "e" .. p, "f" .. p, "g" .. p, "g" .. p, "k" .. p, "n" .. p, "n" .. p, "o" .. p, "t" .. p, "v" .. p}
},
}
m = {
"Greek",
9129,
"grk",
"Grek, Polyt, Brai",
ancestors = "el-kth",
translit = {
Grek = "el-translit",
Polyt = "grc-translit",
},
override_translit = true,
entry_name = {
Grek = {remove_diacritics = c.caron .. c.diaerbelow .. c.brevebelow},
Polyt = s,
},
sort_key = {
Grek = s,
Polyt = s,
},
standardChars = {
Grek = "΅·ͺ΄ΑαΆάΒβΓγΔδΕεέΈΖζΗηΉήΘθΙιΊίΪϊΐΚκΛλΜμΝνΞξΟοΌόΠπΡρΣσςΤτΥυΎύΫϋΰΦφΧχΨψΩωΏώ",
Brai = c.braille,
c.punc
},
}
m = {
"English",
1860,
"gmw-ang",
"Latn, Brai, Shaw, Dsrt", -- entries in Shaw or Dsrt might require prior discussion
wikimedia_codes = "en, simple",
ancestors = "en-ear",
sort_key = {
Latn = {
-- Many of these are needed for sorting language names.
remove_diacritics = "'\"%-%.,%sʻʼ" .. c.diacritics,
-- These are found in entry names.
from = {"æ", "🅱", "", "", "", "", "ɨ", "ł", "", "", "œ", "ꝓ", "ß", "ʋ"},
to = {"ae", "b", "c", "d", "e", "h", "i", "l", "n", "o", "oe", "p", "ss", "v"}
},
},
standardChars = {
Latn = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz",
Brai = c.braille,
c.punc
},
}
m = {
"Esperanto",
143,
"art",
"Latn",
sort_key = {
remove_diacritics = c.grave .. c.acute,
from = {"ĉ", "ĝ", "ĥ", "ĵ", "ŝ", "ŭ"},
to = {"c" .. p, "g" .. p, "h" .. p, "j" .. p, "s" .. p, "u" .. p}
},
standardChars = "AaBbCcĈĉDdEeFfGgĜĝHhĤĥIiJjĴĵKkLlMmNnOoPpRrSsŜŝTtUuŬŭVvZz" .. c.punc,
}
m = {
"Spanish",
1321,
"roa-ibe",
"Latn, Brai",
ancestors = "es-ear",
sort_key = {
Latn = {
remove_exceptions = {"ñ"},
remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.macron .. c.diaer .. c.cedilla,
from = {"ª", "æ", "ñ", "º", "œ"},
to = {"a", "ae", "n" .. p, "o", "oe"}
},
},
standardChars = {
Latn = "AaÁáBbCcDdEeÉéFfGgHhIiÍíJjLlMmNnÑñOoÓóPpQqRrSsTtUuÚúÜüVvXxYyZz",
Brai = c.braille,
c.punc
},
}
m = {
"Estonian",
9072,
"urj-fin",
"Latn",
sort_key = {
from = {
"š", "ž", "õ", "ä", "ö", "ü", -- 2 chars
"z" -- 1 char
},
to = {
"s" .. p, "s" .. p, "w" .. p, "w" .. p, "w" .. p, "w" .. p,
"s" .. p
}
},
standardChars = "AaBbDdEeFfGgHhIiJjKkLlMmNnOoPpRrSsTtUuVvÕõÄäÖöÜü" .. c.punc,
}
m = {
"Basque",
8752,
"euq",
"Latn",
sort_key = {
from = {"ç", "ñ"},
to = {"c" .. p, "n" .. p}
},
standardChars = "AaBbDdEeFfGgHhIiJjKkLlMmNnÑñOoPpRrSsTtUuXxZz" .. c.punc,
}
m = {
"Persian",
9168,
"ira-swi",
"fa-Arab, Hebr",
ancestors = "fa-cls",
entry_name = {
from = {"هٔ", "ٱ"}, -- character "ۂ" code U+06C2 to "ه"; hamzatu l-waṣli to a regular alif
to = {"ه", "ا"},
remove_diacritics = c.fathatan .. c.dammatan .. c.kasratan .. c.fatha .. c.damma .. c.kasra .. c.shadda .. c.sukun .. c.superalef,
},
-- put Judeo-Persian (Hebrew-script Persian) under the category header
-- U+FB21 HEBREW LETTER WIDE ALEF so that it sorts after Arabic script titles
sort_key = {
Hebr = {
from = {"^%f"},
to = {u(0xFB21)},
},
},
}
m = {
"Fula",
33454,
"alv-fwo",
"Latn, Adlm",
}
m = {
"Finnish",
1412,
"urj-fin",
"Latn",
display_text = {
from = {"'"},
to = {"’"}
},
entry_name = { -- used to indicate gemination of the next consonant
remove_diacritics = "ˣ",
from = {"’"},
to = {"'"},
},
sort_key = { -- ] + "aͤ" and "oͤ" as historical variants of "ä" and "ö".
remove_diacritics = "':" .. c.diacritics,
remove_exceptions = {
"a", -- åäaͤ
"o", -- öõőoͤ
"u" -- üű
},
from = {"æ", "", "ł", "ŋ", "œ", "ß", "þ", "u", "å", "aͤ", "o", "ø", "(.)"},
to = {"ae", "d", "l", "n", "oe", "ss", "th", "y", "z" .. p, "ä", "ö", "ö", "%1"}
},
standardChars = "AaBbDdEeFfGgHhIiJjKkLlMmNnOoPpRrSsTtUuVvYyÄäÖö" .. c.punc,
}
m = {
"Fijian",
33295,
"poz-pcc",
"Latn",
}
m = {
"Faroese",
25258,
"gmq-ins",
"Latn",
sort_key = {
from = {"á", "ð", "í", "ó", "ú", "ý", "æ", "ø"},
to = {"a" .. p, "d" .. p, "i" .. p, "o" .. p, "u" .. p, "y" .. p, "z" .. p, "z" .. p}
},
standardChars = "AaÁáBbDdÐðEeFfGgHhIiÍíJjKkLlMmNnOoÓóPpRrSsTtUuÚúVvYyÝýÆæØø" .. c.punc,
}
m = {
"French",
150,
"roa-oil",
"Latn, Brai",
display_text = {
from = {"'"},
to = {"’"}
},
entry_name = {
from = {"’"},
to = {"'"},
},
ancestors = "frm",
sort_key = {Latn = s},
standardChars = {
Latn = "AaÀàÂâBbCcÇçDdEeÉéÈèÊêËëFfGgHhIiÎîÏïJjLlMmNnOoÔôŒœPpQqRrSsTtUuÙùÛûÜüVvXxYyZz",
Brai = c.braille,
c.punc
},
}
m = {
"West Frisian",
27175,
"gmw-fri",
"Latn",
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.diaer,
from = {"y"},
to = {"i"}
},
standardChars = "AaâäàÆæBbCcDdEeéêëèFfGgHhIiïìYyỳJjKkLlMmNnOoôöòPpRrSsTtUuúûüùVvWwZz" .. c.punc,
}
m = {
"Irish",
9142,
"cel-gae",
"Latn, Latg",
ancestors = "mga",
sort_key = {
remove_diacritics = c.acute,
from = {"ḃ", "ċ", "ḋ", "ḟ", "ġ", "ṁ", "ṗ", "ṡ", "ṫ"},
to = {"bh", "ch", "dh", "fh", "gh", "mh", "ph", "sh", "th"}
},
standardChars = "AaÁáBbCcDdEeÉéFfGgHhIiÍíLlMmNnOoÓóPpRrSsTtUuÚúVv" .. c.punc,
}
m = {
"Scottish Gaelic",
9314,
"cel-gae",
"Latn, Latg",
ancestors = "mga",
sort_key = {remove_diacritics = c.grave .. c.acute},
standardChars = "AaÀàBbCcDdEeÈèFfGgHhIiÌìLlMmNnOoÒòPpRrSsTtUuÙù" .. c.punc,
}
m = {
"Galician",
9307,
"roa-ibe",
"Latn",
ancestors = "roa-opt",
sort_key = {
remove_diacritics = c.acute,
from = {"ñ"},
to = {"n" .. p}
},
standardChars = "AaÁáBbCcDdEeÉéFfGgHhIiÍíÏïLlMmNnÑñOoÓóPpQqRrSsTtUuÚúÜüVvXxZz" .. c.punc,
}
m = {
"Guaraní",
35876,
"tup-gua",
"Latn",
}
m = {
"Gujarati",
5137,
"inc-wes",
"Arab, Gujr",
ancestors = "inc-mgu",
translit = {
Gujr = "gu-translit",
},
entry_name = {
remove_diacritics = c.fathatan .. c.dammatan .. c.kasratan .. c.fatha .. c.damma .. c.kasra .. c.kasra .. c.shadda .. c.sukun .. "઼"
},
}
m = {
"Manx",
12175,
"cel-gae",
"Latn",
ancestors = "mga",
sort_key = {remove_diacritics = c.cedilla .. "-"},
standardChars = "AaBbCcÇçDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwYy" .. c.punc,
}
m = {
"Hausa",
56475,
"cdc-wst",
"Latn, Arab",
entry_name = {Latn = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.macron}},
sort_key = {
Latn = {
from = {"ɓ", "b'", "ɗ", "d'", "ƙ", "k'", "sh", "ƴ", "'y"},
to = {"b" .. p, "b" .. p, "d" .. p, "d" .. p, "k" .. p, "k" .. p, "s" .. p, "y" .. p, "y" .. p}
},
},
}
m = {
"Hebrew",
9288,
"sem-can",
"Hebr, Phnx, Brai",
ancestors = "he-med",
entry_name = {Hebr = {remove_diacritics = u(0x0591) .. "-" .. u(0x05BD) .. u(0x05BF) .. "-" .. u(0x05C5) .. u(0x05C7) .. c.CGJ}},
}
m = {
"Hindi",
1568,
"inc-hnd",
"Deva, Kthi, Newa",
translit = {Deva = "hi-translit"},
standardChars = {
Deva = "अआइईउऊएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहत्रज्ञक्षक़ख़ग़ज़झ़ड़ढ़फ़काखागाघाङाचाछाजाझाञाटाठाडाढाणाताथादाधानापाफाबाभामायारालावाशाषासाहात्राज्ञाक्षाक़ाख़ाग़ाज़ाझ़ाड़ाढ़ाफ़ाकिखिगिघिङिचिछिजिझिञिटिठिडिढिणितिथिदिधिनिपिफिबिभिमियिरिलिविशिषिसिहित्रिज्ञिक्षिक़िख़िग़िज़िझ़िड़िढ़िफ़िकीखीगीघीङीचीछीजीझीञीटीठीडीढीणीतीथीदीधीनीपीफीबीभीमीयीरीलीवीशीषीसीहीत्रीज्ञीक्षीक़ीख़ीग़ीज़ीझ़ीड़ीढ़ीफ़ीकुखुगुघुङुचुछुजुझुञुटुठुडुढुणुतुथुदुधुनुपुफुबुभुमुयुरुलुवुशुषुसुहुत्रुज्ञुक्षुक़ुख़ुग़ुज़ुझ़ुड़ुढ़ुफ़ुकूखूगूघूङूचूछूजूझूञूटूठूडूढूणूतूथूदूधूनूपूफूबूभूमूयूरूलूवूशूषूसूहूत्रूज्ञूक्षूक़ूख़ूग़ूज़ूझ़ूड़ूढ़ूफ़ूकेखेगेघेङेचेछेजेझेञेटेठेडेढेणेतेथेदेधेनेपेफेबेभेमेयेरेलेवेशेषेसेहेत्रेज्ञेक्षेक़ेख़ेग़ेज़ेझ़ेड़ेढ़ेफ़ेकैखैगैघैङैचैछैजैझैञैटैठैडैढैणैतैथैदैधैनैपैफैबैभैमैयैरैलैवैशैषैसैहैत्रैज्ञैक्षैक़ैख़ैग़ैज़ैझ़ैड़ैढ़ैफ़ैकोखोगोघोङोचोछोजोझोञोटोठोडोढोणोतोथोदोधोनोपोफोबोभोमोयोरोलोवोशोषोसोहोत्रोज्ञोक्षोक़ोख़ोग़ोज़ोझ़ोड़ोढ़ोफ़ोकौखौगौघौङौचौछौजौझौञौटौठौडौढौणौतौथौदौधौनौपौफौबौभौमौयौरौलौवौशौषौसौहौत्रौज्ञौक्षौक़ौख़ौग़ौज़ौझ़ौड़ौढ़ौफ़ौक्ख्ग्घ्ङ्च्छ्ज्झ्ञ्ट्ठ्ड्ढ्ण्त्थ्द्ध्न्प्फ्ब्भ्म्य्र्ल्व्श्ष्स्ह्त्र्ज्ञ्क्ष्क़्ख़्ग़्ज़्झ़्ड़्ढ़्फ़्।॥०१२३४५६७८९॰",
c.punc
},
}
m = {
"Hiri Motu",
33617,
"crp",
"Latn",
ancestors = "meu",
}
m = {
"Haitian Creole",
33491,
"crp",
"Latn",
ancestors = "ht-sdm",
sort_key = {
from = {
"oun", -- 3 chars
"an", "ch", "è", "en", "ng", "ò", "on", "ou", "ui" -- 2 chars
},
to = {
"o" .. p,
"a" .. p, "c" .. p, "e" .. p, "e" .. p, "n" .. p, "o" .. p, "o" .. p, "o" .. p, "u" .. p
}
},
}
m = {
"Hungarian",
9067,
"urj-ugr",
"Latn, Hung",
ancestors = "ohu",
sort_key = {
Latn = {
from = {
"dzs", -- 3 chars
"á", "cs", "dz", "é", "gy", "í", "ly", "ny", "ó", "ö", "ő", "sz", "ty", "ú", "ü", "ű", "zs", -- 2 chars
},
to = {
"d" .. p,
"a" .. p, "c" .. p, "d" .. p, "e" .. p, "g" .. p, "i" .. p, "l" .. p, "n" .. p, "o" .. p, "o" .. p, "o" .. p, "s" .. p, "t" .. p, "u" .. p, "u" .. p, "u" .. p, "z" .. p,
}
},
},
standardChars = {
Latn = "AaÁáBbCcDdEeÉéFfGgHhIiÍíJjKkLlMmNnOoÓóÖöŐőPpQqRrSsTtUuÚúÜüŰűVvWwXxYyZz",
c.punc
},
}
m = {
"Armenian",
8785,
"hyx",
"Armn, Brai",
ancestors = "axm",
translit = {Armn = "Armn-translit"},
override_translit = true,
entry_name = {
Armn = {
remove_diacritics = "՛՜՞՟",
from = {"եւ", "<sup>յ</sup>", "<sup>ի</sup>", "<sup>է</sup>", "յ̵", "ՙ", "՚"},
to = {"և", "յ", "ի", "է", "ֈ", "ʻ", "’"}
},
},
sort_key = {
Armn = {
from = {
"ու", "եւ", -- 2 chars
"և" -- 1 char
},
to = {
"ւ", "եվ",
"եվ"
}
},
},
}
m = {
"Herero",
33315,
"bnt-swb",
"Latn",
}
m = {
"Interlingua",
35934,
"art",
"Latn",
}
m = {
"Indonesian",
9240,
"poz-mly",
"Latn",
ancestors = "ms",
standardChars = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz" .. c.punc,
}
m = {
"Interlingue",
35850,
"art",
"Latn",
type = "appendix-constructed",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.circ},
}
m = {
"Igbo",
33578,
"alv-igb",
"Latn",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.macron},
sort_key = {
from = {"gb", "gh", "gw", "ị", "kp", "kw", "ṅ", "nw", "ny", "ọ", "sh", "ụ"},
to = {"g" .. p, "g" .. p, "g" .. p, "i" .. p, "k" .. p, "k" .. p, "n" .. p, "n" .. p, "n" .. p, "o" .. p, "s" .. p, "u" .. p}
},
}
m = {
"Nuosu",
34235,
"tbq-nlo",
"Yiii",
translit = "ii-translit",
}
m = {
"Inupiaq",
27183,
"esx-inu",
"Latn",
sort_key = {
from = {
"ch", "ġ", "dj", "ḷ", "ł̣", "ñ", "ng", "r̂", "sr", "zr", -- 2 chars
"ł", "ŋ", "ʼ" -- 1 char
},
to = {
"c" .. p, "g" .. p, "h" .. p, "l" .. p, "l" .. p, "n" .. p, "n" .. p, "r" .. p, "s" .. p, "z" .. p,
"l" .. p, "n" .. p, "z" .. p
}
},
}
m = {
"Ido",
35224,
"art",
"Latn",
}
m = {
"Icelandic",
294,
"gmq-ins",
"Latn",
sort_key = {
from = {"á", "ð", "é", "í", "ó", "ú", "ý", "þ", "æ", "ö"},
to = {"a" .. p, "d" .. p, "e" .. p, "i" .. p, "o" .. p, "u" .. p, "y" .. p, "z" .. p, "z" .. p, "z" .. p}
},
standardChars = "AaÁáBbDdÐðEeÉéFfGgHhIiÍíJjKkLlMmNnOoÓóPpRrSsTtUuÚúVvXxYyÝýÞþÆæÖö" .. c.punc,
}
m = {
"Italian",
652,
"roa-itd",
"Latn",
ancestors = "roa-oit",
sort_key = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.diaer .. c.ringabove},
standardChars = "AaÀàBbCcDdEeÈèÉéFfGgHhIiÌìLlMmNnOoÒòPpQqRrSsTtUuÙùVvZz" .. c.punc,
}
m = {
"Inuktitut",
29921,
"esx-inu",
"Cans, Latn",
translit = {Cans = "cr-translit"},
override_translit = true,
}
m = {
"Japanese",
5287,
"jpx",
"Jpan, Latn, Brai",
ancestors = "ja-ear",
translit = s,
link_tr = true,
display_text = s,
entry_name = s,
sort_key = s,
}
m = {
"Javanese",
33549,
"poz",
"Latn, Java",
ancestors = "kaw",
translit = {Java = "jv-translit"},
link_tr = true,
entry_name = {remove_diacritics = c.circ}, -- Modern jv don't use ê
sort_key = {
Latn = {
from = {"å", "dh", "é", "è", "ng", "ny", "th"},
to = {"a" .. p, "d" .. p, "e" .. p, "e" .. p, "n" .. p, "n" .. p, "t" .. p}
},
},
}
m = {
"Georgian",
8108,
"ccs-gzn",
"Geor, Geok, Hebr", -- Hebr is used to write Judeo-Georgian
ancestors = "ka-mid",
translit = {
Geor = "Geor-translit",
Geok = "Geok-translit",
},
override_translit = true,
entry_name = {remove_diacritics = c.circ},
}
m = {
"Kongo",
33702,
"bnt-kng",
"Latn",
}
m = {
"Kikuyu",
33587,
"bnt-kka",
"Latn",
}
m = {
"Kwanyama",
1405077,
"bnt-ova",
"Latn",
}
m = {
"Kazakh",
9252,
"trk-kno",
"Cyrl, Latn, kk-Arab",
translit = {
Cyrl = {
from = {
"Ё", "ё", "Й", "й", "Нг", "нг", "Ӯ", "ӯ", -- 2 chars; are "Ӯ" and "ӯ" actually used?
"А", "а", "Ә", "ә", "Б", "б", "В", "в", "Г", "г", "Ғ", "ғ", "Д", "д", "Е", "е", "Ж", "ж", "З", "з", "И", "и", "К", "к", "Қ", "қ", "Л", "л", "М", "м", "Н", "н", "Ң", "ң", "О", "о", "Ө", "ө", "П", "п", "Р", "р", "С", "с", "Т", "т", "У", "у", "Ұ", "ұ", "Ү", "ү", "Ф", "ф", "Х", "х", "Һ", "һ", "Ц", "ц", "Ч", "ч", "Ш", "ш", "Щ", "щ", "Ъ", "ъ", "Ы", "ы", "І", "і", "Ь", "ь", "Э", "э", "Ю", "ю", "Я", "я", -- 1 char
},
to = {
"E", "e", "İ", "i", "Ñ", "ñ", "U", "u",
"A", "a", "Ä", "ä", "B", "b", "V", "v", "G", "g", "Ğ", "ğ", "D", "d", "E", "e", "J", "j", "Z", "z", "İ", "i", "K", "k", "Q", "q", "L", "l", "M", "m", "N", "n", "Ñ", "ñ", "O", "o", "Ö", "ö", "P", "p", "R", "r", "S", "s", "T", "t", "U", "u", "Ū", "ū", "Ü", "ü", "F", "f", "X", "x", "H", "h", "S", "s", "Ç", "ç", "Ş", "ş", "Ş", "ş", "", "", "Y", "y", "I", "ı", "", "", "É", "é", "Ü", "ü", "Ä", "ä",
}
}
},
-- override_translit = true,
sort_key = {
Cyrl = {
from = {"ә", "ғ", "ё", "қ", "ң", "ө", "ұ", "ү", "һ", "і"},
to = {"а" .. p, "г" .. p, "е" .. p, "к" .. p, "н" .. p, "о" .. p, "у" .. p, "у" .. p, "х" .. p, "ы" .. p}
},
},
standardChars = {
Cyrl = "АаӘәБбВвГгҒғДдЕеЁёЖжЗзИиЙйКкҚқЛлМмНнҢңОоӨөПпРрСсТтУуҰұҮүФфХхҺһЦцЧчШшЩщЪъЫыІіЬьЭэЮюЯя",
c.punc
},
}
m = {
"Greenlandic",
25355,
"esx-inu",
"Latn",
sort_key = {
from = {"æ", "ø", "å"},
to = {"z" .. p, "z" .. p, "z" .. p}
}
}
m = {
"Khmer",
9205,
"mkh-kmr",
"Khmr",
ancestors = "xhm",
translit = "km-translit",
}
m = {
"Kannada",
33673,
"dra-kan",
"Knda, Tutg",
ancestors = "dra-mkn",
translit = "kn-translit",
}
m = {
"Korean",
9176,
"qfa-kor",
"Kore, Brai",
ancestors = "ko-ear",
translit = {Kore = "ko-translit"},
entry_name = {Kore = s},
}
m = {
"Kanuri",
36094,
"ssa-sah",
"Latn, Arab",
entry_name = {Latn = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.breve}}, -- the sortkey and entry_name are only for standard Kanuri; when dialectal entries get added, someone will have to work out how the dialects should be represented orthographically
sort_key = {
Latn = {
from = {"ǝ", "ny", "ɍ", "sh"},
to = {"e" .. p, "n" .. p, "r" .. p, "s" .. p}
},
},
}
m = {
"Kashmiri",
33552,
"inc-kas",
"ks-Arab, Deva, Shrd, Latn",
translit = {
= "ks-Arab-translit",
Deva = "ks-Deva-translit",
Shrd = "Shrd-translit",
},
}
-- "kv" IS TREATED AS "koi", "kpv", SEE WT:LT
m = {
"Cornish",
25289,
"cel-brs",
"Latn",
ancestors = "cnx",
sort_key = {
from = {"ch"},
to = {"c" .. p}
},
}
m = {
"Kyrgyz",
9255,
"trk-kkp",
"Cyrl, Latn, Arab",
translit = {Cyrl = "ky-translit"},
override_translit = true,
sort_key = {
Cyrl = {
from = {"ё", "ң", "ө", "ү"},
to = {"е" .. p, "н" .. p, "о" .. p, "у" .. p}
},
},
}
m = {
"Latin",
397,
"itc",
"Latn, Ital",
ancestors = "itc-ola",
entry_name = {Latn = {remove_diacritics = c.macron .. c.breve .. c.diaer .. c.dinvbreve}},
sort_key = {
remove_diacritics = c.circ .. c.tilde .. c.macron .. c.diaer .. c.zigzag .. c.dmacron .. c.dtilde .. c.small_a .. c.small_e .. c.small_i .. c.small_o .. c.small_u, -- Medieval abbreviations.
Latn = {
from = {"æ", "œ", ""},
to = {"ae", "oe", "p"}
},
},
standardChars = {
Latn = "AaBbCcDdEeFfGgHhIiLlMmNnOoPpQqRrSsTtUuVvXxZz",
c.punc
},
}
m = {
"Luxembourgish",
9051,
"gmw-hgm",
"Latn",
ancestors = "gmw-cfr",
sort_key = {
from = {"ä", "ë", "é"},
to = {"z" .. p, "z" .. p, "z" .. p}
},
}
m = {
"Luganda",
33368,
"bnt-nyg",
"Latn",
entry_name = {remove_diacritics = c.acute .. c.circ},
sort_key = {
from = {"ŋ"},
to = {"n" .. p}
},
}
m = {
"Limburgish",
102172,
"gmw-frk",
"Latn",
ancestors = "dum",
}
m = {
"Lingala",
36217,
"bnt-bmo",
"Latn",
sort_key = {
remove_diacritics = c.acute .. c.circ .. c.caron,
from = {"ɛ", "gb", "mb", "mp", "nd", "ng", "nk", "ns", "nt", "ny", "nz", "ɔ"},
to = {"e" .. p, "g" .. p, "m" .. p, "m" .. p, "n" .. p, "n" .. p, "n" .. p, "n" .. p, "n" .. p, "n" .. p, "n" .. p, "o" .. p}
},
}
m = {
"Lao",
9211,
"tai-swe",
"Laoo",
translit = "lo-translit",
sort_key = "Laoo-sortkey",
standardChars = "0-9ກຂຄງຈຊຍດຕຖທນບປຜຝພຟມຢຣລວສຫອຮຯ-ໝ" .. c.punc,
}
m = {
"Lithuanian",
9083,
"bat-eas",
"Latn",
ancestors = "olt",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.tilde},
sort_key = {
from = {"ą", "č", "ę", "ė", "į", "y", "š", "ų", "ū", "ž"},
to = {"a" .. p, "c" .. p, "e" .. p, "e" .. p, "i" .. p, "i" .. p, "s" .. p, "u" .. p, "u" .. p, "z" .. p}
},
standardChars = "AaĄąBbCcČčDdEeĘęĖėFfGgHhIiĮįYyJjKkLlMmNnOoPpRrSsŠšTtUuŲųŪūVvZzŽž" .. c.punc,
}
m = {
"Luba-Katanga",
36157,
"bnt-lub",
"Latn",
}
m = {
"Latvian",
9078,
"bat-eas",
"Latn",
entry_name = {
-- This attempts to convert vowels with tone marks to vowels either with or without macrons. Specifically, there should be no macrons if the vowel is part of a diphthong (including resonant diphthongs such pìrksts -> pirksts not #pīrksts). What we do is first convert the vowel + tone mark to a vowel + tilde in a decomposed fashion, then remove the tilde in diphthongs, then convert the remaining vowel + tilde sequences to macroned vowels, then delete any other tilde. We leave already-macroned vowels alone: Both e.g. ar and ār occur before consonants. FIXME: This still might not be sufficient.
from = {"()" .. c.cedilla, "", "()" .. c.tilde .."?()" .. c.tilde .. "?()", "()" .. c.tilde .."?()" .. c.tilde .."?$", "()" .. c.tilde .. "?()" .. c.tilde .. "?", "()" .. c.tilde, c.tilde},
to = {"%1", c.tilde, "%1%2%3", "%1%2", "%1%2", "%1" .. c.macron}
},
sort_key = {
from = {"ā", "č", "ē", "ģ", "ī", "ķ", "ļ", "ņ", "š", "ū", "ž"},
to = {"a" .. p, "c" .. p, "e" .. p, "g" .. p, "i" .. p, "k" .. p, "l" .. p, "n" .. p, "s" .. p, "u" .. p, "z" .. p}
},
standardChars = "AaĀāBbCcČčDdEeĒēFfGgĢģHhIiĪīJjKkĶķLlĻļMmNnŅņOoPpRrSsŠšTtUuŪūVvZzŽž" .. c.punc,
}
m = {
"Malagasy",
7930,
"poz-bre",
"Latn",
}
m = {
"Marshallese",
36280,
"poz-mic",
"Latn",
sort_key = {
from = {"ā", "ļ", "m̧", "ņ", "n̄", "o̧", "ō", "ū"},
to = {"a" .. p, "l" .. p, "m" .. p, "n" .. p, "n" .. p, "o" .. p, "o" .. p, "u" .. p}
},
}
m = {
"Maori",
36451,
"poz-pep",
"Latn",
sort_key = {
remove_diacritics = c.macron,
from = {"ng", "wh"},
to = {"z" .. p, "z" .. p}
},
}
m = {
"Macedonian",
9296,
"zls",
"Cyrl, Grek",
ancestors = "cu",
translit = {Cyrl = "mk-translit"},
entry_name = {Cyrl = {
remove_diacritics = c.acute,
remove_exceptions = {"Ѓ", "ѓ", "Ќ", "ќ"}
}},
sort_key = {Cyrl = {
remove_diacritics = c.grave,
remove_exceptions = {"ѓ", "ќ"},
from = {"ѓ", "ѕ", "ј", "љ", "њ", "ќ", "џ"},
to = {"д" .. p, "з" .. p, "и" .. p, "л" .. p, "н" .. p, "т" .. p, "ч" .. p}
}},
standardChars = {
Cyrl = "АаБбВвГгДдЃѓЕеЖжЗзЅѕИиЈјКкЛлЉљМмНнЊњОоПпРрСсТтЌќУуФфХхЦцЧчЏџШш",
c.punc
},
}
m = {
"Malayalam",
36236,
"dra-mal",
"Mlym",
translit = "ml-translit",
override_translit = true,
}
m = {
"Mongolian",
9246,
"xgn-cen",
"Cyrl, Mong, Latn, Brai",
ancestors = "cmg",
translit = {
Cyrl = "mn-translit",
Mong = "Mong-translit",
},
override_translit = true,
display_text = {Mong = s},
entry_name = {
Cyrl = {remove_diacritics = c.grave .. c.acute},
Mong = s,
},
sort_key = {
Cyrl = {
remove_diacritics = c.grave,
from = {"ё", "ө", "ү"},
to = {"е" .. p, "о" .. p, "у" .. p}
},
},
standardChars = {
Cyrl = "АаБбВвГгДдЕеЁёЖжЗзИиЙйЛлМмНнОоӨөРрСсТтУуҮүХхЦцЧчШшЫыЬьЭэЮюЯя—",
Brai = c.braille,
c.punc
},
}
-- "mo" IS TREATED AS "ro", SEE WT:LT
m = {
"Marathi",
1571,
"inc-sou",
"Deva, Modi",
ancestors = "omr",
translit = {
Deva = "mr-translit",
Modi = "mr-Modi-translit",
},
entry_name = {
Deva = {
from = {"च़", "ज़", "झ़"},
to = {"च", "ज", "झ"}
},
},
}
m = {
"Malay",
9237,
"poz-mly",
"Latn, ms-Arab",
ancestors = "ms-cla",
standardChars = {
Latn = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz",
c.punc
},
}
m = {
"Maltese",
9166,
"sem-arb",
"Latn",
display_text = {
from = {"'"},
to = {"’"}
},
entry_name = {
from = {"’"},
to = {"'"},
},
ancestors = "sqr",
sort_key = {
from = {
"ċ", "ġ", "ż", -- Convert into PUA so that decomposed form does not get caught by the next step.
"()", -- Ensure "c" comes after "ċ", "g" comes after "ġ" and "z" comes after "ż".
"g" .. p .. "ħ", -- "għ" after initial conversion of "g".
p, p, "ħ", "ie", p -- Convert "ċ", "ġ", "ħ", "ie", "ż" into final output.
},
to = {
p, p, p,
"%1" .. p,
"g" .. p,
"c", "g", "h" .. p, "i" .. p, "z"
}
},
}
m = {
"Burmese",
9228,
"tbq-brm",
"Mymr",
ancestors = "obr",
translit = "my-translit",
override_translit = true,
sort_key = {
from = {"ျ", "ြ", "ွ", "ှ", "ဿ"},
to = {"္ယ", "္ရ", "္ဝ", "္ဟ", "သ္သ"}
},
}
m = {
"Nauruan",
13307,
"poz-mic",
"Latn",
}
m = {
"Norwegian Bokmål",
25167,
"gmq",
"Latn",
wikimedia_codes = "no",
ancestors = "gmq-mno, da", -- da as an (but not the) ancestor of nb was agreed on - do not change without discussion
sort_key = s,
standardChars = s,
}
m = {
"Northern Ndebele",
35613,
"bnt-ngu",
"Latn",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.macron .. c.caron},
}
m = {
"Nepali",
33823,
"inc-pah",
"Deva, Newa",
translit = {Deva = "ne-translit"},
}
m = {
"Ndonga",
33900,
"bnt-ova",
"Latn",
}
m = {
"Dutch",
7411,
"gmw-frk",
"Latn, Brai",
ancestors = "dum",
sort_key = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.diaer .. c.ringabove .. c.cedilla .. "'"},
standardChars = {
Latn = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz",
Brai = c.braille,
c.punc
},
}
m = {
"Norwegian Nynorsk",
25164,
"gmq-wes",
"Latn",
ancestors = "gmq-mno",
entry_name = {
remove_diacritics = c.grave .. c.acute,
},
sort_key = s,
standardChars = s,
}
m = {
"Norwegian",
9043,
"gmq-wes",
"Latn",
ancestors = "gmq-mno",
sort_key = s,
standardChars = s,
}
m = {
"Southern Ndebele",
36785,
"bnt-ngu",
"Latn",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.macron .. c.caron},
}
m = {
"Navajo",
13310,
"apa",
"Latn",
sort_key = {
remove_diacritics = c.acute .. c.ogonek,
from = {
"chʼ", "tłʼ", "tsʼ", -- 3 chars
"ch", "dl", "dz", "gh", "hw", "kʼ", "kw", "sh", "tł", "ts", "zh", -- 2 chars
"ł", "ʼ" -- 1 char
},
to = {
"c" .. p, "t" .. p, "t" .. p,
"c" .. p, "d" .. p, "d" .. p, "g" .. p, "h" .. p, "k" .. p, "k" .. p, "s" .. p, "t" .. p, "t" .. p, "z" .. p,
"l" .. p, "z" .. p
}
},
}
m = {
"Chichewa",
33273,
"bnt-nys",
"Latn",
entry_name = {remove_diacritics = c.acute .. c.circ},
sort_key = {
from = {"ng'"},
to = {"ng"}
},
}
m = {
"Occitan",
14185,
"roa-ocr",
"Latn, Hebr",
ancestors = "pro",
sort_key = {
Latn = {
remove_diacritics = c.grave .. c.acute .. c.diaer .. c.cedilla,
from = {"()·h"},
to = {"%1h"}
},
},
}
m = {
"Ojibwe",
33875,
"alg",
"Cans, Latn",
sort_key = {
Latn = {
from = {"aa", "ʼ", "ii", "oo", "sh", "zh"},
to = {"a" .. p, "h" .. p, "i" .. p, "o" .. p, "s" .. p, "z" .. p}
},
},
}
m = {
"Oromo",
33864,
"cus-eas",
"Latn, Ethi",
}
m = {
"Odia",
33810,
"inc-eas",
"Orya",
ancestors = "inc-mor",
translit = "or-translit",
}
m = {
"Ossetian",
33968,
"xsc",
"Cyrl, Geor, Latn",
ancestors = "oos",
translit = {
Cyrl = "os-translit",
Geor = "Geor-translit",
},
override_translit = true,
display_text = {
Cyrl = {
from = {"æ"},
to = {"ӕ"}
},
Latn = {
from = {"ӕ"},
to = {"æ"}
},
},
entry_name = {
Cyrl = {
remove_diacritics = c.grave .. c.acute,
from = {"æ"},
to = {"ӕ"}
},
Latn = {
from = {"ӕ"},
to = {"æ"}
},
},
sort_key = {
Cyrl = {
from = {"ӕ", "гъ", "дж", "дз", "ё", "къ", "пъ", "тъ", "хъ", "цъ", "чъ"},
to = {"а" .. p, "г" .. p, "д" .. p, "д" .. p, "е" .. p, "к" .. p, "п" .. p, "т" .. p, "х" .. p, "ц" .. p, "ч" .. p}
},
},
}
m = {
"Punjabi",
58635,
"inc-pan",
"Guru, pa-Arab",
ancestors = "inc-opa",
translit = {
Guru = "Guru-translit",
= "pa-Arab-translit",
},
entry_name = {
= {
remove_diacritics = c.fathatan .. c.dammatan .. c.kasratan .. c.fatha .. c.damma .. c.kasra .. c.shadda .. c.sukun .. c.nunghunna,
from = {"ݨ", "ࣇ"},
to = {"ن", "ل"}
},
},
}
m = {
"Pali",
36727,
"inc-mid",
"Latn, Brah, Deva, Beng, Sinh, Mymr, Thai, Lana, Laoo, Khmr, Cakm", --and also Khom
ancestors = "sa",
translit = {
Brah = "Brah-translit",
Deva = "sa-translit",
Beng = "pi-translit",
Sinh = "si-translit",
Mymr = "pi-translit",
Thai = "pi-translit",
Lana = "pi-translit",
Laoo = "pi-translit",
Khmr = "pi-translit",
Cakm = "Cakm-translit",
},
entry_name = {
Thai = {
from = {"ึ", u(0xF700), u(0xF70F)}, -- FIXME: Not clear what's going on with the PUA characters here.
to = {"ิํ", "ฐ", "ญ"}
},
remove_diacritics = c.VS01
},
sort_key = { -- FIXME: This needs to be converted into the current standardized format.
from = {"ā", "ī", "ū", "ḍ", "ḷ", "m", "ṅ", "ñ", "ṇ", "ṭ", "()()", "()()", "ᩔ", "ᩕ", "ᩖ", "ᩘ", "()ᩛ", "()ᩛ", "ᩤ", u(0xFE00), u(0x200D)},
to = {"a~", "i~", "u~", "d~", "l~", "m~", "n~", "n~~", "n~~~", "t~", "%2%1", "%2%1", "ᩈ᩠ᩈ", "᩠ᩁ", "᩠ᩃ", "ᨦ᩠", "%1᩠ᨮ", "%1᩠ᨻ", "ᩣ"}
},
}
m = {
"Polish",
809,
"zlw-lch",
"Latn",
ancestors = "zlw-mpl",
sort_key = {
from = {"ą", "ć", "ę", "ł", "ń", "ó", "ś", "ź", "ż"},
to = {"a" .. p, "c" .. p, "e" .. p, "l" .. p, "n" .. p, "o" .. p, "s" .. p, "z" .. p, "z" .. p}
},
standardChars = "AaĄąBbCcĆćDdEeĘęFfGgHhIiJjKkLlŁłMmNnŃńOoÓóPpRrSsŚśTtUuWwYyZzŹźŻż" .. c.punc,
}
m = {
"Pashto",
58680,
"ira-pat",
"ps-Arab",
entry_name = {remove_diacritics = c.fathatan .. c.dammatan .. c.kasratan .. c.fatha .. c.damma .. c.kasra .. c.shadda .. c.sukun .. c.zwarakay .. c.superalef},
}
m = {
"Portuguese",
5146,
"roa-ibe",
"Latn, Brai",
ancestors = "roa-opt",
sort_key = {
Latn = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.macron .. c.diaer .. c.cedilla,
from = {"ª", "æ", "º", "œ"},
to = {"a", "ae", "o", "oe"}
},
},
standardChars = {
Latn = "AaÁáÂâÃãBbCcÇçDdEeÉéÊêFfGgHhIiÍíJjLlMmNnOoÓóÔôÕõPpQqRrSsTtUuÚúVvXxZz",
Brai = c.braille,
c.punc
},
}
m = {
"Quechua",
5218,
"qwe",
"Latn",
}
m = {
"Romansch",
13199,
"roa-rhe",
"Latn",
sort_key = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.diaer .. c.small_e},
}
m = {
"Romanian",
7913,
"roa-eas",
"Latn, Cyrl, Cyrs",
translit = {Cyrl = "ro-translit"},
sort_key = {
Latn = {
remove_diacritics = c.grave .. c.acute,
from = {"ă", "â", "î", "ș", "ț"},
to = {"a" .. p, "a" .. p, "i" .. p, "s" .. p, "t" .. p}
},
Cyrl = {
from = {"ӂ"},
to = {"ж" .. p}
},
},
standardChars = {
Latn = "AaĂăÂâBbCcDdEeFfGgHhIiÎîJjLlMmNnOoPpRrSsȘșTtȚțUuVvXxZz",
Cyrl = "АаБбВвГгДдЕеЖжӁӂЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЫыЬьЭэЮюЯя",
c.punc
},
}
m = {
"Russian",
7737,
"zle",
"Cyrl, Brai",
ancestors = "zle-mru",
translit = {Cyrl = "ru-translit"},
display_text = {
from = {"'"},
to = {"’"}
},
entry_name = {
remove_diacritics = c.grave .. c.acute .. c.diaer,
remove_exceptions = {"Ё", "ё", "Ѣ̈", "ѣ̈", "Я̈", "я̈"},
from = {"’"},
to = {"'"},
},
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.diaer,
remove_exceptions = {"ё", "ѣ̈", "я̈"},
from = {
"ё", "ѣ̈", "я̈", -- 2 chars
"і", "ѣ", "ѳ", "ѵ" -- 1 char
},
to = {
"е" .. p, "ь" .. p, "я" .. p,
"и" .. p, "ь" .. p, "я" .. p, "я" .. p
}
},
standardChars = {
Cyrl = "АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя—",
Brai = c.braille,
(c.punc:gsub("'", "")) -- Exclude apostrophe.
},
}
m = {
"Rwanda-Rundi",
3217514,
"bnt-glb",
"Latn",
entry_name = {remove_diacritics = c.acute .. c.circ .. c.macron .. c.caron},
}
m = {
"Sanskrit",
11059,
"inc",
"as-Beng, Bali, Beng, Bhks, Brah, Mymr, xwo-Mong, Deva, Gujr, Guru, Gran, Hani, Java, Kthi, Knda, Kawi, Khar, Khmr, Laoo, Mlym, mnc-Mong, Marc, Modi, Mong, Nand, Newa, Orya, Phag, Ranj, Saur, Shrd, Sidd, Sinh, Soyo, Lana, Takr, Taml, Tang, Telu, Thai, Tibt, Tutg, Tirh, Zanb", --and also Khom; script codes sorted by canonical name rather than code for ]
translit = {
Beng = "sa-Beng-translit",
= "sa-Beng-translit",
Brah = "Brah-translit",
Deva = "sa-translit",
Gujr = "sa-Gujr-translit",
Guru = "sa-Guru-translit",
Java = "sa-Java-translit",
Kthi = "sa-Kthi-translit",
Khmr = "pi-translit",
Knda = "sa-Knda-translit",
Lana = "pi-translit",
Laoo = "pi-translit",
Mlym = "sa-Mlym-translit",
Modi = "sa-Modi-translit",
Mong = "Mong-translit",
= "mnc-translit",
= "xal-translit",
Mymr = "pi-translit",
Orya = "sa-Orya-translit",
Shrd = "Shrd-translit",
Sidd = "Sidd-translit",
Sinh = "si-translit",
Taml = "sa-Taml-translit",
Telu = "sa-Telu-translit",
Thai = "pi-translit",
Tibt = "Tibt-translit",
},
display_text = {
Mong = s,
Tibt = s,
},
entry_name = {
Mong = s,
Tibt = s,
Thai = {
from = {"ึ", u(0xF700), u(0xF70F)}, -- FIXME: Not clear what's going on with the PUA characters here.
to = {"ิํ", "ฐ", "ญ"}
},
remove_diacritics = c.VS01 .. c.udatta .. c.anudatta
},
sort_key = {
Tibt = "Tibt-sortkey",
{ -- FIXME: This needs to be converted into the current standardized format.
from = {"ā", "ī", "ū", "ḍ", "ḷ", "ḹ", "m", "ṅ", "ñ", "ṇ", "ṛ", "ṝ", "ś", "ṣ", "ṭ", "()()", "()()", "ᩔ", "ᩕ", "ᩖ", "ᩘ", "()ᩛ", "()ᩛ", "ᩤ", u(0xFE00), u(0x200D)},
to = {"a~", "i~", "u~", "d~", "l~", "l~~", "m~", "n~", "n~~", "n~~~", "r~", "r~~", "s~", "s~~", "t~", "%2%1", "%2%1", "ᩈ᩠ᩈ", "᩠ᩁ", "᩠ᩃ", "ᨦ᩠", "%1᩠ᨮ", "%1᩠ᨻ", "ᩣ"},
},
},
}
m = {
"Sardinian",
33976,
"roa",
"Latn",
}
m = {
"Sindhi",
33997,
"inc-snd",
"sd-Arab, Deva, Sind, Khoj",
translit = {Sind = "Sind-translit"},
entry_name = {
= {
remove_diacritics = c.kashida .. c.fathatan .. c.dammatan .. c.kasratan .. c.fatha .. c.damma .. c.kasra .. c.shadda .. c.sukun .. c.superalef,
from = {"ٱ"},
to = {"ا"}
},
},
}
m = {
"Northern Sami",
33947,
"smi",
"Latn",
display_text = {
from = {"'"},
to = {"ˈ"}
},
entry_name = {remove_diacritics = c.macron .. c.dotbelow .. "'ˈ"},
sort_key = {
from = {"á", "č", "đ", "ŋ", "š", "ŧ", "ž"},
to = {"a" .. p, "c" .. p, "d" .. p, "n" .. p, "s" .. p, "t" .. p, "z" .. p}
},
standardChars = "AaÁáBbCcČčDdĐđEeFfGgHhIiJjKkLlMmNnŊŋOoPpRrSsŠšTtŦŧUuVvZzŽž" .. c.punc,
}
m = {
"Sango",
33954,
"crp",
"Latn",
ancestors = "ngb",
}
m = {
"Serbo-Croatian",
9301,
"zls",
"Latn, Cyrl, Glag",
ietf_subtag = "hbs", -- ISO 639-3 code, since "sh" is deprecated from ISO 639-1
wikimedia_codes = "sh, bs, hr, sr",
entry_name = {
Latn = {
remove_diacritics = c.grave .. c.acute .. c.tilde .. c.macron .. c.dgrave .. c.invbreve,
remove_exceptions = {"Ć", "ć", "Ś", "ś", "Ź", "ź"}
},
Cyrl = {
remove_diacritics = c.grave .. c.acute .. c.tilde .. c.macron .. c.dgrave .. c.invbreve,
remove_exceptions = {"З́", "з́", "С́", "с́"}
},
},
sort_key = {
Latn = {
remove_diacritics = c.grave .. c.acute .. c.tilde .. c.macron .. c.dgrave .. c.invbreve,
remove_exceptions = {"ć", "ś", "ź"},
from = {"č", "ć", "dž", "đ", "lj", "nj", "š", "ś", "ž", "ź"},
to = {"c" .. p, "c" .. p, "d" .. p, "d" .. p, "l" .. p, "n" .. p, "s" .. p, "s" .. p, "z" .. p, "z" .. p}
},
Cyrl = {
remove_diacritics = c.grave .. c.acute .. c.tilde .. c.macron .. c.dgrave .. c.invbreve,
remove_exceptions = {"з́", "с́"},
from = {"ђ", "з́", "ј", "љ", "њ", "с́", "ћ", "џ"},
to = {"д" .. p, "з" .. p, "и" .. p, "л" .. p, "н" .. p, "с" .. p, "т" .. p, "ч" .. p}
},
},
standardChars = {
Latn = "AaBbCcČčĆćDdĐđEeFfGgHhIiJjKkLlMmNnOoPpRrSsŠšTtUuVvZzŽž",
Cyrl = "АаБбВвГгДдЂђЕеЖжЗзИиЈјКкЛлЉљМмНнЊњОоПпРрСсТтЋћУуФфХхЦцЧчЏџШш",
c.punc
},
}
m = {
"Sinhalese",
13267,
"inc-ins",
"Sinh",
translit = "si-translit",
override_translit = true,
}
m = {
"Slovak",
9058,
"zlw",
"Latn",
ancestors = "zlw-osk",
sort_key = {remove_diacritics = c.acute .. c.circ .. c.diaer .. c.caron},
standardChars = "AaÁáÄäBbCcČčDdĎďEeFfGgHhIiÍíJjKkLlĹ弾MmNnŇňOoÔôPpRrŔŕSsŠšTtŤťUuÚúVvYyÝýZzŽž" .. c.punc,
}
m = {
"Slovene",
9063,
"zls",
"Latn",
entry_name = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.macron .. c.dgrave .. c.invbreve .. c.dotbelow,
remove_exceptions = {"Ć", "ć", "Ǵ", "ǵ", "Ś", "ś", "Ź", "ź"},
from = {"Ə", "ə", "Ł", "ł"},
to = {"E", "e", "L", "l"},
},
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.macron .. c.dotabove .. c.ringabove .. c.dgrave .. c.invbreve .. c.dotbelow .. c.ringbelow .. c.ogonek,
remove_exceptions = {"ć", "ǵ", "ś", "ź"},
from = {"ä", "č", "ć", "đ", "ə", "ë", "ǧ", "ǵ", "ï", "ł", "ö", "š", "ś", "ü", "ž", "ź"},
to = {"a" .. p, "c" .. p, "c" .. p, "d" .. p, "e", "e" .. p, "g" .. p, "g" .. p, "i" .. p, "l", "o" .. p, "s" .. p, "s" .. p, "u" .. p, "z" .. p, "z" .. p},
},
standardChars = "AaBbCcČčDdEeFfGgHhIiJjKkLlMmNnOoPpRrSsŠšTtUuVvZzŽž" .. c.punc,
}
m = {
"Samoan",
34011,
"poz-pnp",
"Latn",
}
m = {
"Shona",
34004,
"bnt-sho",
"Latn",
entry_name = {remove_diacritics = c.acute},
}
m = {
"Somali",
13275,
"cus-som",
"Latn, Arab, Osma",
entry_name = {Latn = {remove_diacritics = c.grave .. c.acute .. c.circ}},
}
m = {
"Albanian",
8748,
"sqj",
"Latn, Grek, ota-Arab, Elba, Todr, Vith",
translit = {Elba = "Elba-translit"},
entry_name = {Latn = {
remove_diacritics = c.acute,
from = {'^ (%w)', '^të (%w)'}, to = {'%1', '%1'},
}},
sort_key = {Latn = {
remove_diacritics = c.acute .. c.circ .. c.tilde .. c.breve .. c.caron,
from = {'ç', 'dh', 'ë', 'gj', 'll', 'nj', 'rr', 'sh', 'th', 'xh', 'zh'},
to = {'c'..p, 'd'..p, 'e'..p, 'g'..p, 'l'..p, 'n'..p, 'r'..p, 's'..p, 't'..p, 'x'..p, 'z'..p},
}},
standardChars = {
Latn = "AaBbCcÇçDdEeËëFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvXxYyZz",
c.punc
},
}
m = {
"Swazi",
34014,
"bnt-ngu",
"Latn",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.macron .. c.caron},
}
m = {
"Sotho",
34340,
"bnt-sts",
"Latn",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.macron .. c.caron},
}
m = {
"Sundanese",
34002,
"poz-msa",
"Latn, Sund",
ancestors = "osn",
translit = {Sund = "su-translit"},
}
m = {
"Swedish",
9027,
"gmq-eas",
"Latn",
ancestors = "gmq-osw-lat",
sort_key = {
remove_diacritics = c.grave .. c.acute .. c.circ .. c.tilde .. c.macron .. c.dacute .. c.caron .. c.cedilla .. "':",
remove_exceptions = {"å"},
from = {"ø", "æ", "œ", "ß", "å", "aͤ", "oͤ"},
to = {"o", "ae", "oe", "ss", "z" .. p, "ä", "ö"}
},
standardChars = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpRrSsTtUuVvXxYyÅåÄäÖö" .. c.punc,
}
m = {
"Swahili",
7838,
"bnt-swh",
"Latn, Arab",
sort_key = {
Latn = {
from = {"ng'"},
to = {"ng" .. p}
},
},
}
m = {
"Tamil",
5885,
"dra-tam",
"Taml",
ancestors = "ta-mid",
translit = "ta-translit",
override_translit = true,
}
m = {
"Telugu",
8097,
"dra-tel",
"Telu",
translit = "te-translit",
override_translit = true,
}
m = {
"Tajik",
9260,
"ira-swi",
"Cyrl, fa-Arab, Latn",
ancestors = "fa-cls",
translit = {Cyrl = "tg-translit"},
override_translit = true,
entry_name = {remove_diacritics = c.grave .. c.acute},
sort_key = {
Cyrl = {
from = {"ғ", "ё", "ӣ", "қ", "ӯ", "ҳ", "ҷ"},
to = {"г" .. p, "е" .. p, "и" .. p, "к" .. p, "у" .. p, "х" .. p, "ч" .. p}
},
},
}
m = {
"Thai",
9217,
"tai-swe",
"Thai, Brai", --and also Khom
translit = {Thai = "th-translit"},
sort_key = {Thai = "Thai-sortkey"},
}
m = {
"Tigrinya",
34124,
"sem-eth",
"Ethi",
translit = "Ethi-translit",
}
m = {
"Turkmen",
9267,
"trk-ogz",
"Latn, Cyrl, Arab",
entry_name = {remove_diacritics = c.macron},
sort_key = {
Latn = {
from = {"ç", "ä", "ž", "ň", "ö", "ş", "ü", "ý"},
to = {"c" .. p, "e" .. p, "j" .. p, "n" .. p, "o" .. p, "s" .. p, "u" .. p, "y" .. p}
},
Cyrl = {
from = {"ё", "җ", "ң", "ө", "ү", "ә"},
to = {"е" .. p, "ж" .. p, "н" .. p, "о" .. p, "у" .. p, "э" .. p}
},
},
}
m = {
"Tagalog",
34057,
"phi",
"Latn, Tglg",
translit = {Tglg = "tl-translit"},
override_translit = true,
entry_name = {Latn = {remove_diacritics = c.grave .. c.acute .. c.circ}},
standardChars = {
Latn = "AaBbKkDdEeGgHhIiLlMmNnOoPpRrSsTtUuWwYy",
c.punc
},
sort_key = {
Latn = "tl-sortkey",
},
}
m = {
"Tswana",
34137,
"bnt-sts",
"Latn",
}
m = {
"Tongan",
34094,
"poz-ton",
"Latn",
entry_name = {remove_diacritics = c.acute},
sort_key = {remove_diacritics = c.macron},
}
m = {
"Turkish",
256,
"trk-ogz",
"Latn",
ancestors = "ota",
dotted_dotless_i = true,
sort_key = {
from = {
-- Ignore circumflex, but account for capital Î wrongly becoming ı + circ due to dotted dotless I logic.
"ı" .. c.circ, c.circ,
"i", -- Ensure "i" comes after "ı".
"ç", "ğ", "ı", "ö", "ş", "ü"
},
to = {
"i", "",
"i" .. p,
"c" .. p, "g" .. p, "i", "o" .. p, "s" .. p, "u" .. p
}
},
standardChars = "AaÂâBbCcÇçDdEeFfGgĞğHhIıİiÎîJjKkLlMmNnOoÖöPpRrSsŞşTtUuÛûÜüVvYyZz" .. c.punc,
}
m = {
"Tsonga",
34327,
"bnt-tsr",
"Latn",
}
m = {
"Tatar",
25285,
"trk-kbu",
"Cyrl, Latn, tt-Arab",
translit = {Cyrl = "tt-translit"},
override_translit = true,
dotted_dotless_i = true,
sort_key = {
Cyrl = {
from = {"ә", "ў", "ғ", "ё", "җ", "қ", "ң", "ө", "ү", "һ"},
to = {"а" .. p, "в" .. p, "г" .. p, "е" .. p, "ж" .. p, "к" .. p, "н" .. p, "о" .. p, "у" .. p, "х" .. p}
},
Latn = {
from = {
"i", -- Ensure "i" comes after "ı".
"ä", "ə", "ç", "ğ", "ı", "ñ", "ŋ", "ö", "ɵ", "ş", "ü"
},
to = {
"i" .. p,
"a" .. p, "a" .. p, "c" .. p, "g" .. p, "i", "n" .. p, "n" .. p, "o" .. p, "o" .. p, "s" .. p, "u" .. p
}
},
},
}
-- "tw" IS TREATED AS "ak", SEE WT:LT
m = {
"Tahitian",
34128,
"poz-pep",
"Latn",
}
m = {
"Uyghur",
13263,
"trk-kar",
"ug-Arab, Latn, Cyrl",
ancestors = "chg",
translit = {
= "ug-translit",
Cyrl = "ug-translit",
},
override_translit = true,
}
m = {
"Ukrainian",
8798,
"zle",
"Cyrl",
ancestors = "zle-ouk",
translit = "uk-translit",
entry_name = {remove_diacritics = c.grave .. c.acute},
sort_key = {
remove_diacritics = c.grave .. c.acute,
from = {
"ї", -- 2 chars
"ґ", "є", "і" -- 1 char
},
to = {
"и" .. p,
"г" .. p, "е" .. p, "и" .. p
}
},
standardChars = "АаБбВвГгДдЕеЄєЖжЗзИиІіЇїЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЬьЮюЯя" .. c.punc:gsub("'", ""), -- Exclude apostrophe.
}
m = {
"Urdu",
1617,
"inc-hnd",
"ur-Arab,Hebr",
translit = { = "ur-translit"},
entry_name = {
-- character "ۂ" code U+06C2 to "ه" and "هٔ" (U+0647 + U+0654) to "ه"; hamzatu l-waṣli to a regular alif
from = {"هٔ", "ۂ", "ٱ"},
to = {"ہ", "ہ", "ا"},
remove_diacritics = c.fathatan .. c.dammatan .. c.kasratan .. c.fatha .. c.damma .. c.kasra .. c.shadda .. c.sukun .. c.nunghunna .. c.superalef
},
-- put Judeo-Urdu (Hebrew-script Urdu) under the category header
-- U+FB21 HEBREW LETTER WIDE ALEF so that it sorts after Arabic script titles
sort_key = {
from = {"^%f"},
to = {u(0xFB21)},
},
standardChars = "ایببپتثجچحخدذرزژسشصضطظعغفقکگلࣇڷمنݨوؤہھئٹڈڑآے" .. c.punc,
}
m = {
"Uzbek",
9264,
"trk-kar",
"Latn, Cyrl, fa-Arab",
ancestors = "chg",
translit = {Cyrl = "uz-translit"},
sort_key = {
Latn = {
from = {"oʻ", "gʻ", "sh", "ch", "ng"},
to = {"z" .. p, "z" .. p, "z" .. p, "z" .. p, "z" .. p}
},
Cyrl = {
from = {"ё", "ў", "қ", "ғ", "ҳ"},
to = {"е" .. p, "я" .. p, "я" .. p, "я" .. p, "я" .. p}
},
},
}
m = {
"Venda",
32704,
"bnt-bso",
"Latn",
}
m = {
"Vietnamese",
9199,
"mkh-vie",
"Latn, Hani",
ancestors = "mkh-mvi",
sort_key = {
Latn = "vi-sortkey",
Hani = "Hani-sortkey",
},
}
m = {
"Volapük",
36986,
"art",
"Latn",
}
m = {
"Walloon",
34219,
"roa-oil",
"Latn",
sort_key = s,
}
m = {
"Wolof",
34257,
"alv-fwo",
"Latn, Arab, Gara",
}
m = {
"Xhosa",
13218,
"bnt-ngu",
"Latn",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.macron .. c.caron},
}
m = {
"Yiddish",
8641,
"gmw-hgm",
"Hebr",
ancestors = "gmh",
translit = "yi-translit",
sort_key = {
from = {"א", "בּ", "ו", "יִ", "ײַ", "פֿ"},
to = {"א", "ב", "ו", "י", "יי", "פ"}
},
}
m = {
"Yoruba",
34311,
"alv-yor",
"Latn, Arab",
entry_name = {Latn = {remove_diacritics = c.grave .. c.acute .. c.macron}},
sort_key = {
Latn = {
from = {"ẹ", "ɛ", "gb", "ị", "kp", "ọ", "ɔ", "ṣ", "sh", "ụ"},
to = {"e" .. p, "e" .. p, "g" .. p, "i" .. p, "k" .. p, "o" .. p, "o" .. p, "s" .. p, "s" .. p, "u" .. p}
},
},
}
m = {
"Zhuang",
13216,
"tai",
"Latn, Hani",
sort_key = {
Latn = "za-sortkey",
Hani = "Hani-sortkey",
},
}
m = {
"Chinese",
7850,
"zhx",
"Hants, Latn, Bopo, Nshu, Brai",
ancestors = "ltc",
generate_forms = "zh-generateforms",
translit = {
Hani = "zh-translit",
Bopo = "zh-translit",
},
sort_key = {Hani = "Hani-sortkey"},
}
m = {
"Zulu",
10179,
"bnt-ngu",
"Latn",
entry_name = {remove_diacritics = c.grave .. c.acute .. c.circ .. c.macron .. c.caron},
}
return m_lang.finalizeLanguageData(m_lang.addDefaultTypes(m, true))