Wiktionary:Beer parlour/2023/September

Hello, you have come here looking for the meaning of the word Wiktionary:Beer parlour/2023/September. In DICTIOUS you will not only get to know all the dictionary meanings for the word Wiktionary:Beer parlour/2023/September, but we will also tell you about its etymology, its characteristics and you will know how to say Wiktionary:Beer parlour/2023/September in singular and plural. Everything you need to know about the word Wiktionary:Beer parlour/2023/September you have here. The definition of the word Wiktionary:Beer parlour/2023/September will help you to be more precise and correct when speaking or writing your texts. Knowing the definition ofWiktionary:Beer parlour/2023/September, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.

Nuqtaless forms in Hindi

Nuqtaless terms like अर्ज are treated only as alternative spelling here.Those words without nuqta are not just existing only because of poor typset, but they are also pronounced without nuqta sounds. 'arz' is also pronounced as 'arj'. So in those entries native pronunciation should be given preference, and in declension sections transliteration reflecting non-nuqta variant be used or perhaps both the variations. कालमैत्री (talk) 02:38, 1 September 2023 (UTC)

No. The transliterations should be distinguished by the spelling. Where it may make sense to automatically include and prioritise the nuktaless forms is the pronunciation sections. --RichardW57m (talk) 10:44, 1 September 2023 (UTC)
this is what i said अर्ज should be transliterated as arj, which it isnt't just like in other nuqtaless entries. कालमैत्री (talk) 11:35, 1 September 2023 (UTC)
@RichardW57m कालमैत्री (talk) 11:35, 1 September 2023 (UTC)
@कालमैत्री I am inclined to agree with Richard here if I understand what you say correctly. I think the way it's currently done is correct; dictionaries should show the forms with nuqta except in the pronunciation sections (where the pronunciation as 'arj' is already given as an alternative). The only case I think it makes sense not to have the nuqtaless form be a soft redirect is if it's taken on meanings other than the nuqta-full form. Benwing2 (talk) 20:19, 2 September 2023 (UTC)
@Benwing2 I agree with you but there are already many entries without nuqta.So should they not show transliteration of non-nuqta form.Perhaps misunderstanding; i am saying about those non-nuqta forms to include it and and not the nuqta forms, the former shows the transliterations of nuqta form. कालमैत्री (talk) 02:29, 3 September 2023 (UTC)
@कालमैत्री Since it appears from the pronunciation that the nuqtaless forms are mere spelling variants of the forms with nuqta, I don't agree that the translit should be based on the nuqtaless form. Unless the pronunciation is consistently different between nuqta-full and nuqtaless forms, the translits should be the same. This is analogous to how we handle Russian written forms with е in place of ё. Benwing2 (talk) 02:43, 3 September 2023 (UTC)
@Benwing2 They are not mere spelling variants. But pronunciation one too, as regional hindi speakers use the pronunciation of nuqtaless variant.So both transliteration can be used in non-nuqta entry.Or is this unnecessary? कालमैत्री (talk) 02:52, 3 September 2023 (UTC)
@कालमैत्री I don't think it's necessary to include both, as the nuqtaless pronunciation is optional. Benwing2 (talk) 02:56, 3 September 2023 (UTC)
@Benwing2 Well the nuqta pronunciation is similarly optional in those entries.अंग्रेज entry uses pronunciation audio without nuqta sounds कालमैत्री (talk) 03:11, 3 September 2023 (UTC)
@कालमैत्री, @Benwing2, @RichardW57m: The situation with nuqta and nuqta-less forms are indeed very similar to Russian ё (jo) / е (je) words. Regardless of the pronunciation, the spelling with е (je) is more common in regular running Russian texts for native speakers.
  1. ё is standard, е is non-standard or just a relaxed spelling of ё: свёкла (svjókla, beetroot) and свекла́ (sveklá)
  2. е is standard, афе́ра (aféra, shady deal) and афёра (afjóra)
What can be done for Hindi, is provide alternative entry lines where both both transliterations and spelling are nuqtaless. Please take a look at this revision with my new changes of अर्ज (arj) with nuqtaless and alt. form handling. Also drawing attention of @AryamanA. Anatoli T. (обсудить/вклад) 06:28, 3 September 2023 (UTC)
@Atitarev, Benwing2, कालमैत्री: IMO this format is a bit cluttered. I would prefer just giving nuqtaless form as the definition, both pronunciations in IPA, but only the nuqtaless transliteration in the headword. This is just a special case of alt form so having both alt form and nuqtaless form as defns is redundant. —AryamanA (मुझसे बात करेंयोगदान) 21:03, 3 September 2023 (UTC)
@AryamanA, @Benwing2, @कालमैत्री: Thanks for your response, Aryaman. I can revert my edit later but I've got some obvious questions:
In case of अर्ज़ (arz) vs अर्ज (arj), the nuqtaless form is not only alternative spelling but a spelling, which matches the pronunciation. Is it always the case? And are the words or specific nuqta letters where this is not true? For example, is फिल्म (philm) ever pronounced as /pʰɪlm/, not /fɪlm/ as opposed to फ़िल्म (film)?
Since I don't know enough Hindi to judge, I'll use another analogy in Russian "ё" vs "е" spellings.
Unlike свёкла/свекла, афера/афёра where one pronunciation is proscribed but is acceptable, in case of самолёт (samoljót), it's ALWAYS pronounced as if it's spelled so , even if it's spelled самолет (samolet) (not to confuse with spellings and pronunciations in other languages, such as Bulgarian).
So, in the case of свёкла/свекла, афера/афёра - two definition lines with two distinct pronunciations are appropriate.
In case of самолёт/самолет, only a soft-redirect is used.
Hope it's not confusing, please advise your thoughts. Anatoli T. (обсудить/вклад) 01:52, 4 September 2023 (UTC)
@Atitarev Yes film is pronounced as philm in villages and also by those who speak different dialect(however adding seperate entry for other might be worthless).And as of whether it should nuqta or nuqtaless transliteration, i don't know.कालमैत्री (talk) 04:20, 4 September 2023 (UTC)
@कालमैत्री @Atitarev @AryamanA Correct me if I'm wrong but I don't think फिल्म vs. फ़िल्म ever really represent distinct pronunciations. As the last comment says, the word film (spelled either way) can be pronounced philm in villages and some dialects. So it is correct to indicate one as an alt form of the other. Benwing2 (talk) 04:50, 4 September 2023 (UTC)
  1. @Benwing2. The question was, should nuqtaless फिल्म (philm) be {{hi-noun|g=f|tr=film}} or just {{hi-noun|g=f}} (automatically transliterated as "philm") or should have two definition lines, to which AryamanA opposes. AryamanA simpler suggestion to have it both ways in the pronunciations section and no manual translit in the headword will work for me as well. I've made अर्ज (arj) simpler in this revision.
Anatoli T. (обсудить/вклад) 05:04, 4 September 2023 (UTC)
@Atitarev I see, yes I agree with not having two POS headers or definition lines. I would probably rather include the manual translit since the pronunciation is not determined by whether there's a nuqta or not, and having a difference in translit could wrongly lead someone to believe this. Benwing2 (talk) 05:22, 4 September 2023 (UTC)
@Benwing2: But having the same transliteration for two different spellings could lead people to believe there was a speck of dust on the screen. The punctilious would use the different spellings to indicate whether /f/ was permitted or not. --RichardW57m (talk) 10:39, 4 September 2023 (UTC)

Automatic transliteration of katakana and hiragana

(Notifying Eirikr, TAKASUGI Shinji, Atitarev, Fish bowl, Poketalker, Cnilep, Marlin Setia1, Huhu9001, 荒巻モロゾフ, 片割れ靴下, Onionbar, Shen233, Alves9, Cpt.Guapo, Sartma, Lugria, LittleWhole, Chuterix, Mcph2): Is there any reason we don't have automatic transliteration of katakana and hiragana? It seems silly that we have to add manual transliterations to things like {{l|ja|アメリカ}} and {{l|ja|すし}}. —Mahāgaja · talk 07:44, 1 September 2023 (UTC)

Probably because of potential word boundaries / spacing. —Fish bowl (talk) 07:45, 1 September 2023 (UTC)
Is that more of an issue for katakana/hiragana than it is for hangeul, which does have automatic transliteration? —Mahāgaja · talk 07:48, 1 September 2023 (UTC)
Hangeul has spacing and word boundaries are clearer. AG202 (talk) 05:29, 3 September 2023 (UTC)
Instead of {{l|ja|アメリカ}}, I just use {{ja-r|アメリカ}}, which gives アメリカ (amerika). Mcph2 (talk) 07:48, 1 September 2023 (UTC)
OK, but in translation tables we (have to?) use {{t}}, which also doesn't support automatic transliteration. —Mahāgaja · talk 07:50, 1 September 2023 (UTC)
Oh, @Theknightwho has been working on automatic Japanese transliteration these days, he might have a solution. Mcph2 (talk) 07:59, 1 September 2023 (UTC)
Spacing can be manually added, e.g. {{ja-r|あい うえお}}: あいうえお (ai ueo). Mcph2 (talk) 08:03, 1 September 2023 (UTC)
Well, exactly. Are there any issues to having automatic transliteration in {{t}} (for example) that {{ja-r}} hasn't already solved? And it could work for the hiragana transliteration of kanji terms in {{t}} as well. At the moment, we have to write {{t+|ja|子猫|tr=こねこ, koneko}} at kitten#Translations, but surely it should be doable to just write {{t+|ja|子猫|tr=こねこ}} and have a module generate the romaji koneko automatically. —Mahāgaja · talk 08:13, 1 September 2023 (UTC)
@Mahāgaja -- Please don't use both kana and romaji in translation tables. The layout is already very tight, kana is unusable and potentially confusing to much of our readership, and kana text adds nothing useful anyway that we can't get from romanization.
If we don't have automatic kana → romaji conversion, please use {{t+|ja|子猫|tr=koneko}} instead.
If we do have automatic kana → romaji conversion, @User:Theknightwho, for translation tables especially, please don't use ruby -- again, the layout of translation tables is very tight and ruby text above kanji pushes things around in unhappy ways, much of our readership cannot read kana and would find this confusing, there are other usability problems (such as cut-and-paste issues discussed elsewhere), and ruby doesn't add any useful information anyway that cannot be gleaned from the romanization. ‑‑ Eiríkr Útlendi │Tala við mig 17:52, 1 September 2023 (UTC)
@Eirikr There's no ruby there at the moment, and your concerns are exactly why I think we need to discuss things first before implementing big changes like that. I don't think it's insurmountable, but I haven't had the time to look into it yet. That being said, I'm not sure I completely agree with you re the value of rubytext, but that's a separate issue that we've already talked about before. Theknightwho (talk) 17:59, 1 September 2023 (UTC)
@Eirikr: I almost never add Japanese translations myself, but the facts on the ground are that almost all Japanese lines in translation boxes that involve kanji have both hiragana and romaji in the transliteration field. —Mahāgaja · talk 20:36, 1 September 2023 (UTC)
I haven't made any comprehensive effort to check EN entries for JA translations. Those that I've encountered have been scattershot, with kana present more frequently in what appeared to be older edits.
At any rate, I am strongly opposed to including kana in the parens in translation tables -- these are not useful for most readers, and the romanization suffices. I am baffled that people add the kana; it seems editors get lost in the "cool" factor of another script, and don't consider usability / usefulness. By way of counterexample, we don't include bopomofo for Chinese, for instance. ‑‑ Eiríkr Útlendi │Tala við mig 21:31, 1 September 2023 (UTC)
They're pretty useful to me... AG202 (talk) 05:31, 3 September 2023 (UTC)
(Belated reply, but hey...)
@AG202, how are kana useful in translation tables? This would be in cases such as:
The kana only reproduces (some of) the same information as in the romanization. The only differences are that 1) the kana obscures the word boundaries, and 2) the romanization obscures whether long-"o" is from おう or おお. I honestly don't see how the kana string here is at all useful.
Could you expand on how you find kana transliterations useful in translation tables? ‑‑ Eiríkr Útlendi │Tala við mig 20:07, 1 November 2023 (UTC)
I admit I'm probably not "most users" in this case, but for me, it's much much clearer than the romanization in terms of reading speed, and it helps if I want to understand how the Kanji maps to the hiragana. It especially helps with Okinawan and other non-Japanese Japonic languages, since for those, the romanization is less clear and it also shows how they might be written in hiragana.
Also, if Korean were still written entirely in Hanja, I'd expect the same thing (which we kind of already do with parentheses like at Korea#Translations). AG202 (talk) 21:08, 1 November 2023 (UTC)
@Eirikr Nihil sub sóle noví. For some reason, people who study Japanese often get lost in the "cool" factor and forget to be logical or practical. I hate that too, but I'm tired of talking to people who are bewitched and don't want to listen anyway. Japanese entries are a mess on so many levels, they make simple things more confusing and complicated for no good reason, but people like them that way. Thank god I'm fluent in Japanese and don't need Wiktionary. — Sartma 𒁾𒁉𒊭 𒌑𒊑𒀉𒁲 11:00, 17 October 2023 (UTC)
@Mahagaja I would wait till User:Theknightwho comes back on line, he is in the middle of implementing this. I think it works already if you explicitly specify the script as Hrkt. Benwing2 (talk) 08:49, 1 September 2023 (UTC)
@Fish bowl @Mahagaja @Mcph2 @Benwing2 It is actually already enabled if you manually specify the script as Hira, Kana or Hrkt: {{l|ja|^アメリカ|sc=Kana}} gives アメリカ (Amerika). However, this is a stopgap measure and I would prefer if we don't use it generally in entries, as adding script codes to everything would clutter up entries; it's only there so that non-Lua templates can use {{xlit}}. For now, it's best to stick with {{ja-r}} - not least because spaces aren't supported yet.
The reason for this is because I recently split Module:ja-translit in two: the old kana_to_romaji function has been replaced by Module:Hrkt-translit (Hrkt being the ISO code for all kana combined). I then moved the old module to Module:Jpan-translit, which works by scraping pages for readings in a similar fashion to the way Chinese transliteration works. The reading it generates is then given to Module:Hrkt-translit. Jpan-translit is not currently enabled, because it's pending further discussion about how we handle terms with multiple readings.
The reason for this new system is because the two modules work in very different ways, and it means we can avoid wasting resources if we know for certain that a given term is going to be in kana. There's also the fact that some languages (e.g. Ainu) don't use kanji at all, and so it makes sense to have kana transliteration be handled in a standalone way.
Just as a word of caution: don't confuse Kana (the code) with Kana (the script name). Unfortunately, the ISO picked Kana as the script code for katakana, and Hrkt for what they call "Japanese syllabaries" (i.e. hiragana + katakana, with hentaigana grouped under hiragana). I've given Hrkt the name "Kana" because it's the most accurate name for what it actually refers to, and I don't . It won't make any difference 99% of the time, but it's good to be aware just in case. Theknightwho (talk) 09:01, 1 September 2023 (UTC)
@Theknightwho Thanks for the summary. Can you answer the question of when we can expect {{l|ja|^アメリカ}} to work right without explicitly specifying the script code, and what needs to be done and what issues resolved in order for this to happen? Can't you just either rely on the autodetection of the script or make the translit module check the contents of the text being transliterated, so that if it sees it's all Kana (Hiragana or Katakana), it goes ahead and transliterates, and otherwise fails? Benwing2 (talk) 09:08, 1 September 2023 (UTC)
And is it possible to add romaji automatically to the hiragana transliteration in cases like {{t+|ja|子猫|tr=こねこ}} that I mentioned above? BTW, I had never heard the term hentaigana before, and I have to say it doesn't mean what I was expecting it to mean!Mahāgaja · talk 09:11, 1 September 2023 (UTC)
@Benwing2 My intention was that it'd be as soon as Module:Jpan-translit is enabled. At the moment, anything entered as hiragana, katakana (or a mix of the two) will always be detected as Jpan. We could use Module:Hrkt-translit for Jpan as a stopgap, and any incomplete transliterations should return nothing. Alternatively, we could make a specific code override a general code if there's a tiebreak, which would have the same result. That may be preferable, as it means script codes will be more accurate in general.
@Mahāgaja Not yet - the transliteration module can't override manual transliterations. We should be able to integrate the features of {{ja-r}} into the general link modules pretty soon, though, and at that point we should be able to update everything via bot (like we did with Mandarin). There'll need to be a few minor changes to make the syntax compatible, though, which is the main barrier at the moment. That will need to wait until I've finished my major rewrite of Module:languages and Module:links, and I don't want to add any new features to the current versions because they're already too complicated/messy as it is. That should hopefully be done by the end of the month, if not sooner, and at that point I can start working on this. No promises, though. Theknightwho (talk) 09:27, 1 September 2023 (UTC)
@Theknightwho I thought about this a bit. Changing the script detection to return Hrkt or something else other than Jpan is likely to break people's .CSS files that customize based on the Jpan script code. I would use Module:Hrkt-translit as Module:Jpan-translit and have it fail for now if it encounters Kanji. That puts a placeholder for when you resolve the issue of how to handle cases with more than one pronunciation. Benwing2 (talk) 20:15, 2 September 2023 (UTC)
Transliterating Japanese kana (both hiragana and katakana) is long overdue. It is in fact, even simpler than Korean hangeul but the following considerations should always be made:
  1. Spacing, capitalisations and irregular reading for particles (wa) (spelled as "ha") and (e) (spelled as "he") 東京(とうきょう)日本(にほん)首都(しゅと)です (Tōkyō wa Nihon no shuto desu.), どこ() (doko e iku no?). Notice spacing in kana spellings, ^ and separation of particles.
  2. Morpheme boundaries and diphthong readings: 昨日(きのう) (kinō) vs (あらそ) (arasou) and 新潟(にいがた) (Nīgata) vs (あたら)しい (atarashii). Notice the use of "." in kana. Please compare "ō" vs "ou" and "ī" vs "ii", the difference in pronunciations/transliteration mostly depends on morpheme boundaries.
Anatoli T. (обсудить/вклад) 06:40, 3 September 2023 (UTC)
There are many languages (Yiddish is an notable example) where automatic transliteration has to be overridden for some words, so that shouldn't be a problem. We can use |tr=wa with templates like {{l}}, {{m}} and {{t}}, and use |subst=は//わ with {{ux}} and the "cite-" and "quote-" families of templates. —Mahāgaja · talk 19:25, 3 September 2023 (UTC)
@Mahagaja: Sure, that's all doable. Both {{ja-r}} and {{ja-x}} can handle irregular particle readings, as you can see in the examples or it can be done with substitutions as you suggested. Anatoli T. (обсудить/вклад) 23:02, 4 September 2023 (UTC)

Splitting Quechua

Honestly, handling an entire family of mutually unintelligible languages which have their own ISO codes for a while now as four languages (based on the country and historical period they are/were spoken in) doesn't seem like a good idea in general. We'll need most of the codes mentioned here, but probably with slightly different names. If nobody has any fundamental issues with the split itself, I could start drawing up a list of codes and (proposed) names.

Related to this, I also believe we should prohibit the creation of lemmata of Standard Kichwa, as this case is almost identical to Standard Moroccan Amazigh: There are no speakers, it is an artificially created mix of used Ecuadorian Quechua varieties that only accomplishes to make speakers unconfident in their own language use. Thadh (talk) 10:38, 1 September 2023 (UTC)

But are there not readers and writers? --RichardW57m (talk) 10:50, 1 September 2023 (UTC)
But so are there of Klingon and Na'vi. That doesn't make it a language worthy of inclusion in the mainspace. Thadh (talk) 13:08, 1 September 2023 (UTC)
@Thadh Are there any native speakers of Standard Kichwa per se, or are they all native speakers of one of the languages it aims to standardise? Theknightwho (talk) 16:31, 2 September 2023 (UTC)
@Theknightwho: They are all speakers of the distinct dialects, and according to the literature I've read, the speakers suffer quite a lot from the prescriptive nature of the standard (i.e. think their language 'isn't correct'). Thadh (talk) 01:15, 3 September 2023 (UTC)
Bokmål and MSA have no native speakers either; the Klingon and Na'vi comparison is fatuous. Do you have evidence of the stated effects of Standard(ized) Kichwa? In the meantime, I stand with @AG202's stance. ~ Blansheflur 。・:*:・゚❀,。 21:24, 3 September 2023 (UTC)
See the introduction of Aschmann's A reference grammar of Ecuadorian Quichua. I'll cite a couple of passages:
"“Unified Quichua” is a special form of Quichua which has been devised in recent decades, a certain amount of literature has been produced in it (including a Bible translation called Pachacamacpac Quillcashca Shimi), and educational programs have been carried out in it. Unified Quichua was in its origin an artificial language, a mixture of features from various Quichua languages, with all the Spanish borrowings replaced with old (obsolete) Quichua words which the people do not know or whose meanings have changed. (Many of these obsolete words are still contemporary in other regions, some being used in other Ecuadorian Quichua languages, others being Peruvian Quechua words, and others being coined based on existing Quichua forms.) One unfortunate effect of Unified Quichua has been to make those who speak Quichua as their native language feel like they do not speak it well, because they don’t speak it like the academicians say they should! In reality, the native Quichua speakers represent the continuous, native tradition of the language. Another negative result has been that the Quichua young people, who are in some cases being taught the Unified Quichua in school, feel like their grandparents speak the language incorrectly, whereas in reality their grandparents are the ones who speak the language best!"
Aschmann also references the paper by Grzech et al., which write the following in their conclusion:
" At the same time, the linguistic features of Unified Kichwa fail to adequately represent the language which these speakers – acutely aware of linguistic micro-variation and reliant on it for constructing social belonging – perceive as their own. The standard currently in place is divisive and remains largely unused, mostly due to the purist ideology from which it is derived."
There are more comments of these sorts but I believe this is more than enough to conclude that Unified Kichwa is pretty similar to Standard Moroccan Amazight and also not something we'd want in our mainspace. Thadh (talk) 22:18, 3 September 2023 (UTC)
If there's been an agreement not to include SAM lemmas (as it sounds), then I stand with you, as those two cases are most comparable. So yes, I support everything you've put forth. ~ Blansheflur 。・:*:・゚❀,。 22:41, 3 September 2023 (UTC)
Support splitting Quechua. Vininn126 (talk) 18:00, 1 September 2023 (UTC)
Support - no reason why Quechua should be handled as a single language. Theknightwho (talk) 16:31, 2 September 2023 (UTC)
For context for others, see: the prior discussion on Kichwa. I support splitting Quechua, but I don't think I'd support prohibiting the creation of Standard Kichwa. Even if it's not necessarily spoken, it's still written and read, and it seems like a similar situation to Modern Standard Arabic or any other created standard variety created specifically to try and "unite" other lects, for better or for worse. The fact that it makes speakers unconfident in their own usage is unfortunate, but that shouldn't stop us from including the entries if they are cited in usage (similar to how it's been made clear that we include derogatory terms). At best, we could add some kind of label or usage note to disambiguate "standard" terms. AG202 (talk) 05:43, 3 September 2023 (UTC)
I am convinced by User:AG202's argument that we should not prohibit adding Unified Quichua/Kichwa lemmas. Instead I think we should have a label "Unified Kichwa" or similar to identify them. It reminds me a bit of Rumantsch Grischun and Standard Basque, each of which is somewhat controversial and for which similar complaints have been made to the complaints being made here about Unified Kichwa, yet we don't prohibit them. In general we are a descriptive dictionary, and prohibiting a language because some people don't like it seems very prescriptivist. Benwing2 (talk) 04:38, 4 September 2023 (UTC)
I guess that is fine. I also found Category:Moroccan Amazigh language which makes me think I have misunderstood something of previous discussions? The naming made it difficult to find, and it links to an empty Wikipedia page. I am still not sure if the language can be considered a natural language or even in the same way that MSA or modern Hebrew is, but I guess it's fine to keep it, provided we lable it "Unified Kichwa" and keep an eye on new editors adding terms in the other Kichwa varieties. Thadh (talk) 13:02, 4 September 2023 (UTC)
@Thadh The link to Wikipedia is misspelled; the article is at Standard Moroccan Amazigh. Benwing2 (talk) 20:39, 4 September 2023 (UTC)
Fixed the link. Benwing2 (talk) 20:49, 4 September 2023 (UTC)
Support ~ Blansheflur 。・:*:・゚❀,。 22:42, 3 September 2023 (UTC)
I have created this mockup based on the SIL classification. If anyone has any comments, please do tell me. Thadh (talk) 10:06, 20 September 2023 (UTC)
@Thadh How much overlap will there be in entries? Is a "Quechua macrolanguage" approach possible ala Chinese? Not favoring this approach specifically but given the large number of new languages being proposed (44 languages), it's hard for me to imagine editors having the stamina to enter significant numbers of lemmas for all of the languages. If the same spelling occurs for a large fraction of them, it might save a lot of effort to have one entry per spelling rather than duplicating the entry across several Quechua variants. I have been thinking recently we should formalize the concept of "macrolanguage" since this situation exists in several places across the world (cf. Arabic, Romani, Lahnda, Nahuatl, Malay, maybe Akan, etc.). Benwing2 (talk) 18:45, 20 September 2023 (UTC)
I'd strongly oppose any sort of macrolanguage approach such as Chinese. That leads to the issues we're facing now with Chinese where the lects outside of Mandarin and maybe Cantonese are undervalued/brushed-aside in terms of coverage, in addition to the problems in etymology and descendant sections. Assumingly, this change was proposed in the first place to avoid that type of situation. There may be spellings that are the same, but inflections, declensions, pronunciations, etc. can be different. We should focus on whether the lects are mutually intelligble or not and group centered on that. As long as we separate languages like Danish/Swedish/Norwegian (don't even get me started on the 3 Norwegian headers) or Spanish/Galician/Portuguese at entries like ser, then we should avoid macrolanguage solutions for minority languages.
I know it's far from the intent, but it makes it feel like we care about underrepresented languages less. I know from experience that separating, for example, Jeju out from Korean has led to more focused attention to the language recently. It would not have gotten the same level of coverage as it does now at entries like (beot) if it were under Korean. It just needs good editing communities as it finally has now. AG202 (talk) 20:38, 20 September 2023 (UTC)
That being said, there may be some merit to how Wikipedia groups them at Template:Quechuan languages. AG202 (talk) 20:46, 20 September 2023 (UTC)
@AG202 Yes I figured you would make this argument, but I am somewhat offended you are implying I don't care about minority languages. It's quite the opposite. The 44 Quechua languages proposed are hardly parallel to the 3 Scandinavian languages or Spanish/Galician/Portuguese or even Jeju/Korean, and there is a very real problem with getting editors to care about minority languages. Do you really think having a situation where you have to enter the same lemma 44 times under 44 headers is going to encourage people to contribute more to Quechua languages? I am trying to think creatively about how to deal with what is a very real issue and your response is to cast aspersions. Yes, you feel strongly about this but please keep the emotions out of the discussion. Benwing2 (talk) 20:59, 20 September 2023 (UTC)
@Benwing2 I specifically said that I know that's not your (or anyone's) intent, and it was a statement of the project as a whole (I used "we" for a reason). I also did not state that we'd need 44 lemmas under the same page, in fact I found it unncessary. I even gave a possible limitus test that we could follow such as mutual intelligibilty. I also gave the example of how Wikipedia groups them. I do feel strongly about this issue, but I gave a solution and purposefully avoided casting judgment on any single person. I'm honestly somewhat offended that you turned what I said in that way (casting aspersions? I made no statements on anyone in particular's reputation or intent). "Yes I figured you would make this argument", also doesn't ring nicely to my ears. I know the work you've put in for minority languages, and I've appreciated it openly. I just don't think that we should have a unified Quechua like we do with Chinese, nor do I think we should repeat the Chinese method or anything like it with any group of languages, along with giving my rationale why and giving a separate solution. That's all. Please please make sure to read what I say thoroughly. AG202 (talk) 21:30, 20 September 2023 (UTC)
The issue is that the phonologies (and thus spelling) of these varieties are very different and, unlike with Chinese, you can't just list pronunciations in a box and be done with it because the morphological differences are substantial.
If at some point a good editor comes around and we find that a couple of varieties are easily treatable under one header, we can always merge them. But I wouldn't know the best way to group forty different lects with 2000 years of divergent evolution. Thadh (talk) 23:40, 20 September 2023 (UTC)
@Thadh If you really split into 44 varieties and people start actually entering terms under those separate varieties, it will not be easy to merge them. Just look at the situation with North and South Levantine Arabic, which the Arabic editors want to merge into Levantine Arabic. Back in January or so there was a long discussion about this but it's a huge effort since there are separate and overlapping lemmas, separate sets of headword and inflection templates, etc. I offered to help by bot but I don't have either the background or the time to do it all by myself, and as a result it's gone nowhere. If on the other hand there are no terms entered, merging will not be hard but then what is the point of having separate L2 headers if there aren't any terms? In general it's far better to do the design work up front rather than having to merge after the fact. Looking at the template that User:AG202 linked above and randomly clicking on the Kichwa language (which Wikipedia says is a single language despite having 12 separate ISO 639-3 codes), and looking at the sample sentence given, the dialects look awfully similar. The first and second sentences are identical, the third and fourth differ by two phonemes, the seventh and eighth differ by two phonemes, etc. These appear significantly closer than e.g. the various Occitan and Romansh dialects, both of which are inflected languages where the fact that you have multiple dialects differing somewhat in phonology and morphology hasn't prevented their grouping under a single L2 header. Given the similarities, I would oppose a 44-way split. Benwing2 (talk) 02:37, 21 September 2023 (UTC)
I am also extremely sceptical of the idea that a 44-way split is sensible. (Indeed, given the evidence that the dialects are very similar, I oppose it.) There are benefits, yes, to having reference works that drill down and focus only on one dialect, so someone can work comprehensively on that dialect, whether that's a variety of Quechua or e.g. the speech of north Boston; a specialized reference work specifically about that dialect is likely to cover it in more detail than even a comprehensive general work about the language as a whole like the OED. But there are also drawbacks, especially when one is (as we are) writing a work about more than just one dialect ... which is one reason we don't have a separate L2s for ==Boston==, ==Tyneside==, ==Connacht== Irish vs ==Munster== Irish, etc. I would prefer a more serious proposal, looking at what differences are actually extreme enough to warrant splitting. - -sche (discuss) 03:44, 21 September 2023 (UTC)
@-sche: You do understand that there is a fundamental difference between handling standardised varieties (English, Irish) and unstandardised ones? We already are trying something like this with Karelian (cf kežä), and while this works somewhat (but not very well, because of lexical differences), it would not be maintainable if you had four or five of these dialects. Thadh (talk) 09:06, 21 September 2023 (UTC)
Oh, I've long understood it, as the same situation exists for Low German. Perhaps we lemmatize Quechua on its standardized variety despite that standard's differences from the dialects, like is done with e.g. Irish; perhaps we lemmatize on select regional forms if we pursue a split into a couple languages, perhaps we lemmatize on the forms closest to the common ancestor form as has been proposed for Low German, or perhaps we take the English approach (lemmatize any given word on whichever US-vs-UK form was entered first), but I doubt "Norwegian on steroids" is the best approach. - -sche (discuss) 17:05, 21 September 2023 (UTC)
@-sche: But Quechua doesn't have a standardised variety, and Kichwa (Ecuadorean Quechua) has a standard which, as I have explained above, is horrible and mostly unused. Our current entries in Quechua are almost entirely in Cuzco Quechua, which is simply the dialect spoken in the historical capital of the Inca empire - It has twice as many phonemes as Ayacucho Quechua for instance, and these are still both languages of a single branch.
And you are making false comparisons: We are talking about a language family, not a branch of five hundred years old spoken in Europe and written for all of its history. Two thousand years is approximately the time depth of the entire Germanic branch - We currently have 76 Germanic languages; Why would 44 Quechuan languages seem to much? Thadh (talk) 06:41, 22 September 2023 (UTC)
@Thadh Umm, it is you making false comparisons between the Germanic languages; there are probably more lemmas in German alone than there will ever be for all Quechuan varieties, however many we end up with. 44 languages (which BTW appear to me far more similar to each other than many of the Germanic languages, and which is far more than Wikipedia says there are) will be a prohibitive load for the editing community. In general we treat each family on its own and do not make facile comparisons like this. Benwing2 (talk) 06:52, 22 September 2023 (UTC)
Saying that we don't have to split a language well just because we won't have enough editors to support all languages makes no sense - A single Quechuan is a far greater problem than a few languages with no lemmas.
Again, speaking from experience with Karelian: Even though we now have infrastructure and it is only two varieties that are standardised, we can't update all the 700 pages the language already has because of multiple editing sprees (not to mention the ones that actually are Livvi but have been called "Karelian" either by mistake or before we split the two), resulting in new editors not using them and thus either providing untrue information (that a word is part of both standards) or incomplete information (not saying to which standard a word belongs). Having two separate L2s, North Karelian and South Karelian would resolve this.
Thadh (talk) 07:01, 22 September 2023 (UTC)
@Thadh But there is a huge difference between 2 Karelian languages and 44 Quechuan languages. I am not insisting on a single Quechuan language but I strongly oppose 44, which will almost certainly be unmanageable. A split to 5 or 6, per Wikipedia, sounds reasonable; differences within individual groups can be handled the same way they are for Occitan and Romansh. Benwing2 (talk) 07:06, 22 September 2023 (UTC)
But Karelian has only had three hundred years of divergence. You're still handling Quechua from the perspective that it is one language that should be split, whereas you should be looking at it like it is a family that should be split. 44 languages in a language family is not a lot. Thadh (talk) 07:27, 22 September 2023 (UTC)
Comparing Quechua to Norwegian is a really bad comparison. It's much closer to Slavic or Mayan in terms of similarity/relatedness, and linguists across the board agree those shouldn't be classified as a single language. Vininn126 (talk) 06:50, 22 September 2023 (UTC)
@Vininn126 I don't think User:-sche is arguing for one language, simply not 44. Benwing2 (talk) 06:54, 22 September 2023 (UTC)
I really don't see why us neglecting this family for a decade should mean that we can't split it into as many languages as it needs to be split. I'm all for merging a few languages if you have any ideas on that and if it makes sense, but saying that we can't split them at all because "44 is too much" makes no sense. 44 is better than 1 - that's just the way it is. Thadh (talk) 07:09, 22 September 2023 (UTC)
@Thadh Blah, what you're essentially saying is "I don't like your reasons but I'm not giving any of my own; my proposal is just better." I would take your proposal more seriously if you gave reasons why the multiple arguments I've made aren't good. Otherwise we're just talking past each other. Benwing2 (talk) 07:18, 22 September 2023 (UTC)
No, your arguments so far have been: "I've seen one sentence translated into seven closely related varieties and two of them turned out to be identical, SO we need to reduce the number of splits" - Great, but the 44 I've given is the starting point for that, not the 1 single Quechua. Thadh (talk) 07:30, 22 September 2023 (UTC)
@Thadh I agree with User:-sche that the default is non-addition rather than addition. No matter how many times and how loudly you insist that 44 is "better than 1", "not a lot", what it "needs to be split" into etc., you've given no substantive arguments that we need this many L2's. You simply threw out a proposal and are defending it largely on the basis of IMO specious comparisons to unrelated language families. I can speak from experience that splitting languages is easier than merging (having successfully split Kurdish and having unsuccessfully tried to merge South and North Levantine Arabic), and I've seen the mess that results when we have several similar L2 lects, with splits that may be ill-considered (Arabic, with 20+ nearly identical headword modules; Norwegian; Romani; etc., and I bet there are similar issues with Nahuatl but I haven't looked closely). I find sche's proposal of 5-8 lects compelling and I'm willing to accept that, but I am completely unconvinced we need 44. Benwing2 (talk) 20:40, 23 September 2023 (UTC)
Any argument that a 44-way split is not "sensible" simply because the number 44 is so high is the classical fallacy of "this has worked until now, why can't it still work?" We should be asking ourselves: were the linguistic realities of Quechua the exact same but we would only now begin to document it, which approach would we take? If we decide that each of the lects should get its own L2, then we should do that regardless of how not "sensible" a split may seem. — SURJECTION / T / C / L / 07:27, 22 September 2023 (UTC)
This, this, this. We should be looking at what varieties actually need their own L2s, as we would if they were new, closely related lects coming up for possible addition. We shouldn't be going "what's a good number? 44 sounds good. just add an L2 for every lect that has a distinct name, regardless of how distinct it is, and assume we'll evaluate how many we should actually have later". Let's evaluate how similar the lects are now. - -sche (discuss) 22:32, 22 September 2023 (UTC)
Then let's do it, but again, the starting point is 44 - if we can't figure this out, then keeping one language is simply not good enough. Thadh (talk) 09:33, 23 September 2023 (UTC)

────────────────────────────────────────────────────────────────────────────────────────────────────The starting point when someone proposes "add an L2 for x" is non-addition; it must be demonstrated to be needed. Looking at the literature (Adelaar 2004; Adelaar 2012 citing Torero 2002 and earlier; Lyle Campbell's American Indian Languages 2000, citing Cerrón Palomino 1987 and Mannheim 1991; Landerman 1991), I see support for between five and eight languages:

  1. Central Quechua.
  2. Pacaraos Quechua.
  3. Yungay or Cajamarca-Lambayeque Quechua, sometimes split in two, with ...
    1. the mutually-intelligible, 94%-similar Cajamarca-Lambayeque lect to which the Cajamarca and Lambayeque dialects belong being separated from ...
    2. the "Central" Yungay lect to which the Lincha and Laraos dialects belong.
  4. Southern Quechua (for which Cerrón Palomino devised an orthography); sometimes split, with ...
    1. Ayacucho ...
    2. and Cuzco are separated due to phonological developments, with ...
    3. Collao and Bolivian Quechua either grouped with Cuzco or forming a third unit.
  5. Northern Quechua; from which Landermann separates out ...
    1. North Peruvian Quechua
    2. (vs the remainder of Northern Quechua)

If there is evidence that the other ~35 varieties are sufficiently divergent from these, it would be most helpful. - -sche (discuss) 16:43, 23 September 2023 (UTC)

Support, but I think we should try to figure out what varieties (if any) are mutually intelligible. And keep similar varieties together? But many varieties of Quechua seem to be completely unintelligible, so splitting Quechua is a no brainer سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 07:28, 22 September 2023 (UTC)
@-sche, Benwing2: I'm going to try to explain this one last time:
We have a language family that is now treated by one L2. No person in their right mind would handle Quechua as one language, so saying that "non-addition is the starting point" is just wrong.
Our handling of language codes is always based on ISO in the first place, with mergers and splits being discussed based on that. ISO treats this as 44 languages. I've just literally taken over the codes from ISO and given them different names, that's all I have done so far, and I have asked everyone to give feedback on possible mergers within.
Now, we can discuss this, and I am open to any change. But saying that I have to find evidence that supports a split into any two languages because we already have one Quechua code is absurd. As Surjection has said, we need to analyse this as if we're adding a new language, not as if we're splitting an already "working" language, because it isn't working: You cannot handle Russian and Czech as one language, and you cannot handle Quechuan languages as one. Just because at some point we made a terrible mistake by giving the entire family just one code instead of a bunch of them doesn't make these lects more closely related or better handable as one L2. So please, stop putting the burden of proof on me alone and instead help me out by discussing what of the 44 codes can and cannot be merged. Thadh (talk) 22:13, 23 September 2023 (UTC)
I agree with @Thadh on this. Theknightwho (talk) 22:18, 23 September 2023 (UTC)
I also agree. And reiterate again, I never want to repeat the Chinese system again with any other language family. AG202 (talk) 22:41, 23 September 2023 (UTC)
We definitely need a way of handling dialect continua, but until we have such a system it is better to split than to merge at this stage, I'd say. Theknightwho (talk) 23:23, 23 September 2023 (UTC)
@Theknightwho What do you agree with? Having 44 codes or not having 1? @Thadh I am sure you think I'm not reading what you've said, but it feels to me like it is you who has not read anything I've written. I have already said I am open to having more than one (I said 5-6 earlier, before sche's proposal), just not the maximalist approach of 44, which you seem to be insisting on. User:-sche just gave a specific proposal based on mutual intelligibility, with 5-8 languages depending on our choices, and I agreed with it. In my experience it is easier to split than to merge, as I have already said, so I would err on the side of fewer L2's (i.e. 5 rather than 8), but I am not going to object if the consensus is to have 8. I have given several examples of successful ways of treating multiple dialects under a single L2 header, and I can help with the technical aspects of implementing this, but I don't want to be in the situation of having to clean up 44 copied sets of templates and modules, similar to the mess of the current Arabic lects. Benwing2 (talk) 00:16, 24 September 2023 (UTC)
@Benwing2 I think the starting point should be the ISO. It is essentially arbitrary that we decided not to go with it when creating 8,000+ language codes back in 2011 or so, and unless anyone can comment here as an expert then we have to accept that we don’t have the knowledge or skills to decide differently.
I simply disagree that 44 is a maximalist approach - instead, it feels like someone arbitrarily decided that a whole language family should be grouped together because they had no idea what they were doing, and right now it’s in our interests to rectify that to what should have been done in the first place. If it later turns out to require mergers, then we should do that based on the expertise of any active Quechuan editors, and it shouldn’t pose too much difficulty since the worst-case scenario is sticking labels on every entry. Splitting is much more laborious. Theknightwho (talk) 00:25, 24 September 2023 (UTC)
@Theknightwho Well, I disagree here about splitting vs. merging based on my experience doing both. I also think that since ISO (or rather Ethnologue, which is where the ISO codes come from) has the concept of "macrolanguage", they are content to give separate codes to mutually intelligible lects: they can also assign a code to the macrolanguage. My very first comment in this thread was that we need to adopt a macrolanguage approach of some sort as well, but this was universally shot down. So if we're not gonna do that, we need to be more parsimonious about our splits. Benwing2 (talk) 01:20, 24 September 2023 (UTC)
@Benwing2 But Spanish and Portuguese are just as mutually intelligible, or more in some cases. Mutual intelligibility alone is not a good justification for grouping separate lects together unless we can show that they are actually treated as a single language by their users. Theknightwho (talk) 01:39, 24 September 2023 (UTC)
This is precisely why I gave the Spanish/Portuguese/Galician example earlier (not including Old Spanish, Old Galician-Portuguese as well). The concept of merging always seems to head upon languages that have less coverage and fewer editors, and decisions are often made by a few folks, and then have to be corrected when editors eventually come by (if even). It’s what leads us to three Norwegian L2s vs our Quechuan macrolanguage situation right now. “Duplication” of modules and templates and such can be done well if enough time and effort are given to them as well. Again, this critique does not fall on any single person, but these minority languages can also reach the coverage of others if they’re given the space and care to begin with. AG202 (talk) 01:55, 24 September 2023 (UTC)
I disagree there is relevance between the case of Quechua and Spanish/Portuguese, and I disagree that we need to respect how native speakers view the situation. If the latter were true, we would keep Kurdish and Arabic together, and split Serbo-Croatian. In the former case, there are separate literary standards and incompatible spelling conventions between Spanish and Portuguese. (If Galician used the reintegrationist spelling, I would argue for its merger with Portuguese, BTW.) There is a huge difference in editor resources between Spanish, Portuguese and Quechua, and we cannot pretend this difference away. Yes, in some ways this is deplorable, but it is the way it is. We need to do what is most workable for our editors given the resources we have, and fundamentally, having 44 L2 lects will greatly increase the editor burden compared with 5-8. @AG202 Yes, you can abstract out the differences between templates and modules if you know what you're doing, but most people don't, and I don't trust that this will happen with 44 Quechua lects given it didn't happen with Arabic, Nahuatl or other such situations. Benwing2 (talk) 02:12, 24 September 2023 (UTC)
I’m not arguing that we should have 44 L2s, but I also don’t think that just because there are fewer resources that we should continue the issue on our own end. In the spelling case with Galician & Portuguese, honestly, the different spelling/literary standards has not stopped others in the past (Serbo-Croatian for example) and one could easily use labels for it. Now, I’m not arguing that they should be merged, but if we were basing it on literary standards then there should be many more splits. And then for the templates/modules, yes I don’t expect everything to be fixed and okay immediately, but in the same vein, this is why I strongly believe that we should have stronger foundations for building up minority languages in general. It’s difficult regardless of the situation, and there could be a lot more support. AG202 (talk) 02:34, 24 September 2023 (UTC)
ISO has 44 different codes for the Quechuan langs. CitationsFreak (talk) 22:35, 23 September 2023 (UTC)
Rather than an a priori thought exercise in merging ISO language codes, I propose an experiment:
  • pick N Quechua lemmas at random-ish that are already on Wiktionary. I think N between 50 and 100 should do the trick.
    • another alternative - a Quechua Swadesh list
  • see which Quechua/Quichua varieties (per ISO) they correspond to. This is not perfect, since I assume some lemmas are part of the artificial unified standard, but unless someone knows variant spellings, it's what we've got.
It could turn out that only a handful of the 44 codes are needed in the immediate present, so we add those first, and add more if/when there are both 1) editors well-versed enough and 2) reference sources or quotable text.
Chernorizets (talk) 23:52, 23 September 2023 (UTC)
Re "well Ethnologue gave it 44 codes", I just want(ed) to point out (although after E/Cs and IRL interruptions I see part of this has already been pointed out above) that Ethnologue also acknowledges that many of the Quechua dialects they assigned codes to are mutually intelligible with 94%+ similarity (e.g. qvc, quf). More to the point, Ethnologue grants things codes differently than we do because their goals are different than ours; Ethnologue's (SIL's) goal is publishing Bible translations in various individual dialects, to best convert people, while Wiktionary's goal is of course, as we all know, "to describe all words of all languages" all at once in one dictionary. We've known that our differing goals meant differing practices since Wiktionary's early days, the history of WT:LT is the history of cases where we note that they give e.g. several sub-dialects of the same dialect of the same language several codes, or (in multiple cases) give the same language two codes (e.g. et, ekk) whereas we don't. Re "unless we can show that they are actually treated as a single language by their users": oh, you should've said you wanted to go with speakers' view! The literature notes that speakers often consider there to be one language, which they consider themselves to speak (often substandard) dialects of — so do you want just one code? Personally, I'm more inclined to agree with scholars of Quechua that there are several major languages here which are not mutually intelligible, and should have more than one code... I'm just also inclined to agree with scholars on how many there are. A few people here seem to disagree with both speakers and scholars. - -sche (discuss) 02:32, 24 September 2023 (UTC)
@-sche the separation you've proposed seems in line with what's given in Wikipedia and other sources I have handy; I'd only be curious if Central needs to be broken down a bit.
This is how the 43 codes (the 44th is actually retired) break down per Glottolog: https://glottolog.org/resource/languoid/id/quec1387. Central corresponds to 16 ISO codes.
Chernorizets (talk) 03:28, 24 September 2023 (UTC)
@Sumiaz, Huhsunqu, Linguist97 if any of you are still checking in, what are your thoughts on how Quechua should be split? Chernorizets (talk) 02:46, 24 September 2023 (UTC)
This is what I lead with on my user page: "For the purposes of this page and the entries I am working on, "Quechua" primarily refers to Southern Quechua. There is substantial overlap between varieties, but pronunciation, meanings, and grammar might differ to the point of making them mutually unintelligible. In a way, "Quechua" (or "Runasimi") as described by Wiktionary and its corresponding Wikipedia at this time is probably closer to a literary/internet standard or dachsprache than an accurate representation of any single variety of Quechua."
The attempts to consolidate the Quechuan languages into a standard form was probably from the beginnings of the Inca Empire to the early Spanish colonial era when the Tupac Amaru rebellion led to official suppression of the language. In the early colonial days the Spanish found the language convenient to use as a lingua franca for administration and proselytizing, and so there is some amount of Quechua literature from this era. In some cases the writer may note the origin of the sample, but in many cases it is simply described as "Quechua". This seems to still persist in Peru where many learning materials and dictionaries are for "Quechua", although there are now more resources for specific varieties such as judicial manuals that have Cuzco and Chanca editions. If the colonial authorities had not suppressed the language and continued to use it across the Viceroyalty, perhaps a written standard may have developed over time that could be the basis for entry creation today.
As a non-native-speaker and frankly beginning-level learner, my goal in adding Quechuan entries to Wiktionary was to improve visibility of the language in the online space for native and non-native speakers alike. Despite some recent improvements by governments in Peru, Bolivia, and Ecuador to promote education and use of Quechua, its use in the media (newspapers, television, radio, and online) remains limited. I would think that the ideal resolution for this issue is to get input from native speakers as to how their languages should be grouped/split. Unfortunately it would seem that Quechua has not yet penetrated the internet space enough for there to be a vibrant community of native or non-native regular speakers to have this discussion.
From what I could find, it would seem that Quechuan speakers identify as speaking "Quechua" with the variety acting as a qualifier. Because the acceptance of Quechua in the wider public sphere is still limited, you see for example the Peruvian public broadcaster TV Peru having a single Quechua-language news program. Likewise, language-learning resources are often written for "Quechua". The variety of ISO codes may accurately reflect the unique variety spoken by a population, but has the effect of fragmenting the potential userbase for the language in this space. The number of contributors towards Quechuan languages here is already fairly small. Where the umbrella group of "Quechua" may encourage collaboration between users, having too many specific varieties separated could result in a case where a variety may be managed by one or few users that don't communicate with the users contributing to another. Similarly, a user searching for a Quechuan term may not find as much value in a case where the available corpus for each individual variety is limited.
Because native speakers self-identify with the broader "Quechua" family first and variety second; the discussion to separate languages has not received significant input from its native speakers by virtue of its as-yet limited accessibility online; and the Wiktionary community contributing to adding and maintaining Quechua entries is already small and splitting the languages would fragment this community and reduce collaboration, at this time I would vote to Oppose. Sumiaz (talk) 04:37, 24 September 2023 (UTC)
Oppose because I’m in favor of a single header for a dialect continuum regardless of mutual intelligibility. Given that we use {{dialect labels}}, macrolanguages are a non-issue. ·~ dictátor·mundꟾ 15:18, 30 September 2023 (UTC)
They're universally not considered a dialect continuum across all Quechuan languages. It'd be as if saying all Slavic languages are a dialect continuum. AG202 (talk) 15:52, 30 September 2023 (UTC)

I've just got around to publishing the draft for Wiktionary:About Icelandic, which had been in the request pile. It'd be good to get the feedback of any regular contributors to Icelandic, so if there's anything missing or misrepresented feel free to add it or let me know - I haven't as yet contributed much to the language on here and don't have much familiarity with the language-specific editing norms or templates. In particular, I couldn't find anything at all in the discussion pages about the cut off date we use for when Old Norse ends and modern Icelandic begins or whether in practice it's not much of an issue. Helrasincke (talk) 17:28, 1 September 2023 (UTC)

@Helrasincke Hi, I'm not an Icelandic editor, but about the content of that page: there's this here sentence, "Following is a simplified entry for the German word orðabók (“dictionary”). It shows the fundamental elements of an Icelandic entry:", but should it say "Icelandic" instead of "German"? Did you perhaps adapt this from About German? Anyway, nice work! Kiril kovachev (talkcontribs) 00:38, 2 September 2023 (UTC)
@Kiril kovachev Whoops, well spotted! Helrasincke (talk) 15:54, 3 September 2023 (UTC)
@Helrasincke No problem! And also, I apologize to split hairs, but you may also want to check the "Spelling" section:
there looks to be a sentence starting, "Letters such as"... that goes unfinished. I guess there was meant to be a passage about symbols that changed in their usage in some way after that. Otherwise, looks good :) Kiril kovachev (talkcontribs) 18:43, 3 September 2023 (UTC)

Dingal language add request

should the language be added? कालमैत्री (talk) 20:18, 1 September 2023 (UTC)

or should it be treated under rajasthani language already on Wiktionary कालमैत्री (talk) 20:27, 1 September 2023 (UTC)
I know nothing about Dingal, but Wikipedia suggests it's ancestral to both Rajasthani and Gujarati, which as far as I'm concerned is reason enough for it to be a separate language with a code of its own. —Mahāgaja · talk 20:45, 1 September 2023 (UTC)
how to add the language in Wiktionary कालमैत्री (talk) 13:06, 2 September 2023 (UTC)
The Wikipedia article shows signs of having been puffed up by editors who may or may not know what they're talking about, an issue many articles on Indian topics suffer from, so I would feel better if I could find information about the language in other sources. So far I haven't been able to find much. Glottolog doesn't seem to have it. Searching Google Books turns up little. There is a mention in an essay in Language Versus Dialect: Linguistic and Literary Essays on Hindi, Tamil, and Sarnami (ed. by Mariola Offredi, 1990), page 68, which says "The Caran were numerous above all in Marvar, whose regional language (Marvari) was later known as Dingal.6 The Dingal language entered the court thanks to the Caran and became the standard literary language in the vast Marvar region, ..." where the footnote 6 (on page 88) is "There is much discussion about the meaning of the word 'dingal'. This has been used since the nineteenth century, with reference to the literature in western Rājasthānī, known also as Marubhāṣā and Mārvārī. For the various interpretations, see MOTILAL MENARIYA 1949, 15-24. It is, however, futile to go into this discussion, since scholars have not yet come to any definite conclusions." Rajendra Kumar Dave, Society and Culture of Marwar (1992), page 103, says "Dingal—The literary form of Marwari was called Dingal. The word 'Dingal' was used for the first time by Kushallabh in his work Pingal Siromani composed in V.S. 1607-18. The word has been defined in various ways by scholars. Tessitori calls it a language of rustics and a language without grammar." (That seems harsh, considering other works call it the language of poetry.) I can't find anything on how intelligible or not it is with modern Marwari or other forms of Rajasthani. - -sche (discuss) 18:34, 3 September 2023 (UTC)
effect of prakrit on dingal literature.; dingle literature both in hindi might be of importance.Other such work exist but all in hindi.There are also dingal words in a hindi dictioanry here. कालमैत्री (talk) 11:02, 4 September 2023 (UTC)
MCgregor says" an archaising form of early Mārvāṛī language, as used in Rājasthānī bardic poetry". कालमैत्री (talk) 11:06, 4 September 2023 (UTC)

Inclusion policy regarding given names?

I noticed we have no concrete policies regarding names (not individuals but given, middle and surnames), such as how many people must bear a name in order for it to qualify for inclusion and such. I am asking because I discovered we have very few Afrikaans-language names and so I wanted to add some. However, some of the names I had in mind—those I know from my family tree—appear to be fairly rare, several yielding less than 10,000 or in extreme cases less than 1,000 or even 100 results on FamilySearch's vital records (including duplicate records for the same person). I am still a fairly new editor here and so I obviously do not wish to accidentally create numerous unnecessary or non-notable entries as it could be annoying to clean up. I recently created Sarel which has 2,555 results on said genealogical website for South African (1600–present) records, but others like the surname Heystek / Heijstek give me less than 900 search results and I wonder if there should be a limit to what can be added. I notice we have hundreds of entries like Odajyan that just say "According to data collected by Forebears in 2014, Odajyan is the 1988048th most common surname in the United States, belonging to 1 individuals", but IMO this is not good practice. I would love to hear your opinions. Kindest regards, LunaEatsTuna (talk) 22:39, 1 September 2023 (UTC)

@LunaEatsTuna This is a good discussion we need to have. Really rare names shouldn't be present, e.g. Odajyan should maybe be given as an Armenian last name but that's all. Otherwise we'll get a Cambrian explosion of useless entries. Benwing2 (talk) 22:49, 1 September 2023 (UTC)
My thoughts exactly. Wiktionary is not a database of surnames, after all. LunaEatsTuna (talk) 23:15, 1 September 2023 (UTC)
I see it has the boilerplate statistic line with "Odajyan is the 1988048th most common surname in the United States, belonging to 1 individuals" (sic). The ordinal seems practically meaningless if only one person has the surname. —Al-Muqanna المقنع (talk) 23:35, 1 September 2023 (UTC)
I also find that it puts an utterly disproportionate focus on a statistic that’s essentially random noise at that point, too. I really don’t know why we need it. Theknightwho (talk) 21:35, 2 September 2023 (UTC)
This has been discussed before- even the person who added them doesn't care for them much, but their reasoning can be summarized as "it's better than nothing". Chuck Entz (talk) 22:42, 2 September 2023 (UTC)
We've had a couple of RFV's for surnames lately, including Klingon (which surprised me by passing), Nazndah (which failed), and Mozela (which didn't really complete). The standard for both given names and surnames that we followed in those RFD's is to treat them like ordinary words, meaning that they need three cites. In this case, a document with a list of names can be a cite, but it must be in the language we're looking for.
If Odajyan fails RFV, could we continue to list it as a transcription of the Armenian, or do we only write Armenian words in the Armenian alphabet? Thanks, Soap 22:50, 1 September 2023 (UTC)
About this name specifically ... i see now that it's a spelling variant of Odadjian, which would be trivially easy to cite, as it is the surname of a musician. The -ian spelling of Armenian names in general is the traditional one, at least in the United States, having been overtaken in recent years by -yan perhaps because it's more true to the Armenian pronunciation. For this name, and perhaps others, we have the -yan spelling as the standard and -ian as a variant. If we can cite Odadjian, does that mean Odajyan also passes? Im guessing not, because even though the Armenian name is the same, we're in some sense creating a new name by Romanizing it. Soap 23:01, 1 September 2023 (UTC)
I think surnames being treated as regular words is a good idea. Also, would I be correct in assuming that vital records like birth certificates would not count as ordinary citations towards the inclusion of names? I do not recall reading about whether or not such citations were even allowed on Wikt or not, and I would agree if they do not count for inclusion but I am just asking for clarification. (If they were allowed this would essentially make the de facto policy for inclusion that at minimum three people bear a name). LunaEatsTuna (talk) 23:06, 1 September 2023 (UTC)
As far as Armenian surnames are concerned, this is what is going on. According to Armenian law, all passports record the owner's first name and surname in Armenian and in an English transcription. There is a strict scheme for automatically replacing each Armenian letter with an English letter or digraph, without regard to the actual pronunciation of the Armenian surname or the resultant English transcription. Օդաջյան (Ōdaǰyan) becomes Odajyan, Քարտաշյան (Kʻartašyan) becomes Kartashyan, Սարգսյան (Sargsyan) becomes Sargsyan, Պետրոսյան (Petrosyan) becomes Petrosyan, no variants are possible. If a person from the current iteration of Armenia (AD 1991–) emigrates to England or an English colony or becomes famous in English-language media, he will be recorded under the legally transcribed English name. This is what Forebears did for the one recent US citizen Odajyan.
The situation with the older diaspora is different. They are not bound or influenced by the Republic's transcription rules. They usually adapt their Armenian name to the local language according to their taste and with more regard to pronunciation and euphony. I can sympathize as foreigners distort my Petrosyan to things like /petroʃan/, /petroʒan/, /petrozian/. It is better to adapt it as Petrossian in English and French lands to approximate the correct /petrosjan/. Odadjian and Odajian are adapted versions of Odajyan. Many adaptation variants are possible, look at the forms of Hakobyan. Sometimes adaptation goes so far that I can't even figure out the native form: compare Bilzerian.
Since the local passport transcription system is predictable and fixed, I had chosen its form as the main one (Odajyan) and listed the old diasporan adaptations as variant forms (Odadjian, Odajian). Admittedly, the old diasporan spellings are easier to attest in English because their bearers had better chance to be recorded in English.
Because Wiktionary's policy on names is undeveloped, I do not create foreign entries for Armenian surnames anymore. Instead, I list the passport transcription and all diasporan spellings I can find in the Descendants section of the Armenian entry as in Համամչյան (Hamamčʻyan). Vahag (talk) 10:33, 3 September 2023 (UTC)

Ban cross-family comparisons from EDAL

Self-explanatory. Any cross-family comparison sourcable only to EDAL or affiliated sources should be banned or at the very least worded in such a way that makes it clear that the "Altaic" family is a pseudolinguistic fringe theory. — SURJECTION / T / C / L / 15:21, 2 September 2023 (UTC)

Support. We need to remove macro-level Altaic comparisons. I wouldn't be surprised if some lower-level connections are established but that's far outside the standard linguistic view of things and we don't need to be hosting fringe. Fringe is cringe bro. Vininn126 (talk) 15:25, 2 September 2023 (UTC)
I wouldn't be opposed to this, either. It often feels like many of our active Proto-Turkic editors are sneakily adding in Altaic comparisons and references (including lists of comparanda of supposed regular sound correspondences!) whenever they think they can get away with it. — SURJECTION / T / C / L / 15:45, 2 September 2023 (UTC)
Let me clarify what I mean: perhaps in the future smaller connections between languages in this area will become more established within the linguistic mainstream, but until that time we shouldn't host it. Vininn126 (talk) 15:49, 2 September 2023 (UTC)
Support Over at the Proto-Turkic page we've already been slowly phasing out Altaic reconstructions and comparisons are made with Mongolic if they cannot be explained through conventional borrowing. Yorınçga573 (talk) 15:41, 2 September 2023 (UTC)
Support. AG202 (talk) 05:46, 3 September 2023 (UTC)
Support BurakD53 (talk) 13:01, 3 September 2023 (UTC)
Support ~ Blansheflur 。・:*:・゚❀,。 20:53, 4 September 2023 (UTC)
Provisional Support -- @Surjection, could you expand on what "EDAL" is? From context and other threads, I think this is Starling, but I'm not sure. ‑‑ Eiríkr Útlendi │Tala við mig 04:23, 5 September 2023 (UTC)
The Etymological Dictionary of the Altaic Languages which Starling proudly and prominently features. — SURJECTION / T / C / L / 04:30, 5 September 2023 (UTC)
Gotcha, thank you! In that case, most definitely Support. I did a brief survey of Dolgopolsky's work, on which Starling's Japonic entries appear to be based, and found an alarmingly bad failure rate. Discussed some in this old thread: Thread:User_talk:Rua/Unexplained_deletions:_continuing_what_appears_to_be_a_common_theme/reply_(5).
Yes, please rip out EDAL by the roots. If and when something more rigorous replaces it, perhaps we can use that future work as reference, but for now, please deep six anything relying on EDAL. ‑‑ Eiríkr Útlendi │Tala við mig 05:12, 5 September 2023 (UTC)
Oppose 'Altaic' is apparently wrong rather than a pseudolinguistic fringe theory. --RichardW57m (talk) 14:15, 5 September 2023 (UTC)
How does being wrong vs being pseudolinguistic change anything about if we should include it? CitationsFreak (talk) 05:54, 6 September 2023 (UTC)
@CitationsFreak: The proposal is that any use of the EDAL for inter-family comparisons, or where it is the only traceable publication of an idea or suggestion, shall be accompanied by a denigration of Altaic as 'pseudolinguistic'. Not even a mere statement of disbelief in Altaic will suffice. --RichardW57 (talk) 13:35, 9 September 2023 (UTC)
The idea is that if we include such a disclaimer ("Has been compared to X, but falls under the Altaic theory which has been disregarded by most linguists" or something to that effect), we will thereby make sure no new editor will feel the need to add these comparisons. We often add theories that have been deemed unlikely, that makes our etymological coverage more complete, but it doesn't necessary do any harm. Thadh (talk) 11:39, 22 September 2023 (UTC)

Bulgarian name dictionary reference template

Hello to all Bulgarian editors, I don't know if this resource has been used before, but today I found a dictionary that documents Bulgarian personal names, which can be viewed online, just like the etymological dictionary. I've written a reference template at {{R:bg:LIFUB}}; the syntax is {{R:bg:LIFUB|page_number|entry_name}}, where the entry name can be omitted 9 times/10 if it's the same as the page title. @SimonWikt @Chernorizets @Bezimenen. This is a good help in adding accents to names with unclear stress, as well as expanding small entries: check out Апостолов, for example. Hope this helps! Kiril kovachev (talkcontribs) 18:17, 2 September 2023 (UTC)

In general, you don't need a BP thread for this. Perhaps alerting other editors on a talk page for the template or something similar will suffice. Vininn126 (talk) 18:20, 2 September 2023 (UTC)
@Kiril kovachev: This is great. I think we need pages documenting these resources, although Category:Bulgarian reference templates is a good start. As for where to post this, maybe WT:About Bulgarian and pinging the relevant users? (Although I wouldn't have seen this as you didn't ping me.) Benwing2 (talk) 20:11, 2 September 2023 (UTC)
@Vininn126 Yes, that's fair enough; my apologies. (I partly posted it here because I don't know for sure who else might be a Bulgarian editor, or be interested in the template anyway.) @Benwing2 Sorry for not @ing you, I wasn't sure whether you still edit Bulgarian these days and I didn't want to spam you with it in that case. So, fortunate that you check this place often enough to see. :)
I may well update WT:About Bulgarian with some information about our templates for this.
Thanks, Kiril kovachev (talkcontribs) 20:23, 2 September 2023 (UTC)
Don't apologize! I'm just informing. Vininn126 (talk) 20:24, 2 September 2023 (UTC)
Thanks for the heads-up! Kiril kovachev (talkcontribs) 22:29, 2 September 2023 (UTC)
@Kiril kovachev very cool! Chernorizets (talk) 21:41, 2 September 2023 (UTC)

User:Dragonoid76 requested equivalents of {{PIE root see}} for Proto-Indo-Iranian, Proto-Indo-Aryan and Sanskrit. I realize that there isn't a proper template currently for this. It should be {{rootsee}} but (a) that template doesn't quite do it, (b) it is a total mess. I am going to redo {{rootsee}} to work similarly to {{root}}:

|1=
Destination language of category Category:Destination terms derived from the Source root *root-. If left out or set to the value +, you get the umbrella category Category:Terms derived from the Source root *root-.
|2=
Source language of category Category:Destination terms derived from the Source root *root-. If left out or set to the value +, or equal to the destination language, you instead get Category:Destination terms belonging to the root *root-. However, if both source and destination language are left out or set to +, and the current page is in the Reconstruction namespace, the source language is inferred from the pagename and you get Category:Terms derived from the Source root *root- (otherwise you get an error). If the destination language is a family code and not a valid language code, the family code is converted to the corresponding proto-language. This means you can write ine for Proto-Indo-European, iir for Proto-Indo-Iranian, inc for Proto-Indo-Aryan, etc.
|3=
Root. If left out or set to the value +, it is taken from the subpage name (i.e. after a slash in the case of Reconstruction namespace items). If the source language is reconstruction-only, you can leave out the initial *. In addition, a hyphen may be added according to the following algorithm:
  1. If there is a space or hyphen in the root already, no hyphen is added.
  2. If the root is in a non-Latin script, no hyphen is added.
  3. Otherwise if the source language is Navajo, a hyphen is added onto the beginning, otherwise onto the end.
|id=
Sense ID of the root; needed especially for Navajo.

This means, for example, that you can write {{rootsee}} by itself on a reconstructed root page and get Category:Terms derived from the Source root *root- automatically. This should make {{PIE root see}} totally unnecessary. Current uses of {{rootsee}} that default to PIE will have to be changed to add ine as the second argument, so that e.g. {{rootsee|en|*gʷem}}, which currently gets you Category:English terms derived from the Proto-Indo-European root *gʷem-, will change to {{rootsee|en|ine|*gʷem}}. Benwing2 (talk) 23:16, 2 September 2023 (UTC)

I have written the module underlying this, see Module:User:Benwing2/rootsee and User:Benwing2/test-rootsee, as well as the bot script to convert existing uses of {{rootsee}} and {{PIE root see}}. If no one objects, I will do the conversion in the next couple of days. Benwing2 (talk) 02:45, 3 September 2023 (UTC)
Thanks, this looks good. It was always a bit odd seeing a template with a generic name like rootsee being specifically bound to descendants of PIE. Soap 15:38, 4 September 2023 (UTC)
Looks good! Thanks! Dragonoid76 (talk) 21:56, 4 September 2023 (UTC)
Just for ease, could you make sure {{User:Benwing2/rootsee|pagename=भृ}} works like {{User:Benwing2/rootsee|+|sa|भृ}}. Right now, I'm getting "Unable to infer source from pagename 'भृ' as it isn't a Reconstruction or Appendix page", since it's not a reconstruction page. Dragonoid76 (talk) 22:13, 4 September 2023 (UTC)
We just got rid of the assumption that any root without a language code is Proto-Indo-European, now we're adding back special cases. How is the module supposed to know that the page in question is Sanskrit? What knowledge do we have to give it to be able to tell? It would have to be something that would hold true for the foreseeable future, no matter what happens to the type of page in question. Can we guarantee that no Sanskrit root will ever share a page with a root for any other language? I'm not trying to shoot down your idea- I just want to make sure someone thinks about this kind of thing. Chuck Entz (talk) 22:43, 4 September 2023 (UTC)
@Chuck Entz Good point. In the case of words written in the Devanagari script, I can't think of a case where the root wouldn't be Sanskrit—but it's probably just better design to use the template like {{rootsee|+|sa|भृ}}. Dragonoid76 (talk) 22:52, 4 September 2023 (UTC)
@Dragonoid76: Devanagari-script Mundari verb roots look like a distinct possibility. Mundari is too embryonic on Wiktionary to know that they will be used, but they certainly look like a potential feature, and Mundari can form derived nouns from verbs by infixing a nasal. (We currently have no Mundari verbs.) It's not impossible that editors new to Pali might try to use a Pali root in Devanagari - we already have a few Pali root names in Devanagari, such as गह (gaha, gah), which are legitimate alternative script forms for the Pali name of the root. --RichardW57m (talk) 09:18, 18 September 2023 (UTC)

lemmas

hi how to find what languages in wiktionary have most lemmas 31.7.113.40 16:57, 3 September 2023 (UTC)

There is a sortable list at Wiktionary:Statistics. Einstein2 (talk) 23:22, 3 September 2023 (UTC)
Remember the phrase 'lies, damned lies and statistics'. In languages whose users command deep, morphologically marked derivations, synchronically derived terms may be marked as lemmas. --RichardW57m (talk) 08:26, 4 September 2023 (UTC)

Listing taxonomical names in Derived terms sections

I'd like to find out if there is a policy about how much detail should go into listing taxonomical names in Derived terms sections. See rigó. Currently the Hungarian name is followed by the English translation and the taxonomical name. I wonder if this is all considered useful by other editors. Should I just list the Hungarian names? Panda10 (talk) 17:38, 3 September 2023 (UTC)

I don't think there is a policy. I think they are useful to indicate which taxon is indicated by the vernacular name and whether multiple taxa are indicated. Taking English as an example, many really common one-word English vernacular names cover multiple species and multiple higher-level taxa, sometimes even kingdoms. It doesn't do a user much good to have to go on merry chase through other references to disambiguate the term. OTOH, it can be time-consuming for a contributor to do so. If we save two users from such a merry chase at the cost to one of us doing it once, there is a net social gain. DCDuring (talk) 19:03, 3 September 2023 (UTC)
I Agree. --RichardW57m (talk) 11:52, 5 September 2023 (UTC)
What DCDuring says. I think an overview like on جَوْز (jawz) is more useful than it without the derived terms and pages for the derived terms together; the derived terms are also necessary to show that it has a broader meaning with respect to derived terms, rather than just meaning “walnut” also corresponding to “nut” in English. It can become ridiculous of course if you have such lists at tree or اوت (ot) – at so general terms I believe nobody wants to read lists of taxonomic names. Fay Freak (talk) 11:58, 5 September 2023 (UTC)
Probably not. A few examples, as at tree now, might be OK. I don't know what is best at terms like common, vulgar, or bastard or the terms for colors and body parts: Complete or illustrative listings of derived terms? DCDuring (talk) 15:05, 5 September 2023 (UTC)

On the house entry, we have an extremely large collapsible listing all of the derived terms in alphabetical order. There are two phrases, on the house and the house always wins, that are specifically bound to sense 6, subsense 2, and make no sense in any other context. If there existed an inline version of the derived terms template, I would want to use that underneath s6:2 so that readers would know that it is specifically tied to this narrow definition. But so far as I know there is no such template, and if there were one, we would probably discourage widespread use, since it would take up space and merely duplicate terms that already appear below. We have collocations, but my understanding is that they are typically used for phrases which do not have their own entries and therefore cannot be links either. So perhaps the best way to help the reader is to use wikilinks within the use-examples we currently give. This is currently forbidden by our Manual of Style page, which specifically says that use-examples must

not contain wikilinks (the words should be easy enough to understand without additional lookup).

However, as I read it, this is intended to discourage introducing difficult, unrelated words into use-examples which would need wikilinks in order to be understood. That is, if my word were house, I would not do well to add a use-example such as

Next to the galamander was a small grey house.

Where the unrelated word galamander both distracts the reader and tells them nothing about houses. By contrast, linking to the expressions on the house and the house always wins underneath the one specific sense of house that they are bound to seems like the best solution for this rare situation.

Ideally, if we can agree this exception to the policy is valid, I would like to see a small change to the Manual of Style to reflect this, rather than just tolerating a few exceptions here and there, so that this won't be a source of conflict in the future.

Best regards, Soap 12:54, 4 September 2023 (UTC)

Good luck with the wording. DCDuring (talk) 16:47, 4 September 2023 (UTC)
Where does it say this? I can't find the word 'wikilink' in WT:STYLE.
Does it actually ban such links in quotations, or does it just ban them from usage examples? Banning them from usage examples makes sense. --RichardW57m (talk) 09:05, 5 September 2023 (UTC)
It isn't in WT:STYLE, but in Wiktionary:Example_sentences#Official_policy. I understand why wikilinks would be banned from examples of English usage, as the English-language Wiktionary addresses English speakers, who may be expected to have familiarity with all the words in the example other than the target/feature word. But I have been adding wikilinks quite liberally to non-English usage examples (in complete ignorance of the above prohibition, which I have only just learned of, and with nobody raising any objection), because it strikes me as helpful to language learners and not detrimentally distracting. The example's accompanying translation tells the reader what the sentence means as a whole, but may not be sufficient to clarify how the sentence means what it means, even when it is quite short and rudimentary in its composition. The reader may also want to go on a voyage of discovery in an unfamiliar language via wikilinks. This is where I believe it to be helpful to provide links to some of the constituent words, taking advantage of one of the key benefits of a wiki, namely the links. Voltaigne (talk) 11:35, 5 September 2023 (UTC)
@Soap: Ah, you're quoting WT:EL#Example sentences or its expansion page, which don't apply to quotations.
I think what you should do is to elaborate the headword line from {{en-PP}} to
{{en-PP|head=on the {{l|en|house|id=s6v2}}}}
on the house
where s6v2 is the {{senseid}} for the sense you want to link to. Obviously you should choose a better name for the senseid.
@Theknightwho: Can you please advise on how to remove the use of {{l}}? Or is the weak requirement in {{head}} not to use {{link}} to be ignored in this case? --RichardW57m (talk) 11:43, 5 September 2023 (UTC)
{{en-PP|head=on the ]}}
on the house
Same result without {{l}}.
Voltaigne (talk) 12:46, 5 September 2023 (UTC)
Thank you both, but I dont think that links to subsenses are meant to be put in the header, as they would appear the same as normal links, and I see no reason a user would think to click the link in just this one particular case since they are presumably already familiar with the generic sense of the word house. We could add etymology sections and mention the derivation there instead. But that doesn't address what I came here asking about.
I want to put links on the house page, in the use-examples, where a link containing the entry word would stand out and alert the reader to the existence of the common set phrases on the house and the house always wins. If we cannot do this, the only indication of these phrases is in the derived terms section, which is (by prior consensus) sorted alphabetically rather than by sense, therefore mixing the syntactically bound terms in with hundreds of others.
Again we could consider this a proposal for an exception to the rule about not putting links in use-examples, but the way I see it, as above, the rule exists to prevent users from writing sentences with irrelevant and unhelpful words, distracting the reader, because a good use-example will focus on the entry word. I think a good way to do this would be to work the common expressions on the house and the house always wins into the use-examples under sense 6, subsense 2 of house, and that this is the most likely place the user will be looking for them. Soap 12:58, 5 September 2023 (UTC)
The value of a sense link in the inflection line is that a user thinking that the link will require a search of the entire English house L2 will be delighted to be taken to a specific, relevant sense. If we do this often enough users will become hopeful that they can be led to the correct sense and therefore may click through more often. DCDuring (talk) 13:38, 5 September 2023 (UTC)
I dont have a problem with modifying the inflection line to point to a particular sense of a constituent word, but I still say, as above, that few users are likely to click on it, as there is no visual cue suggesting that anything unusual is there, except perhaps the lack of links for the other words in the header. We rejected tooltips for the same reason.
Whether we link headers or not still has nothing to do with my original proposal. So far, nobody has agreed with me, so I'll just point out that this rule we have on the Manual of Style seems at least not to be enforced all too strictly, as we have a link to badass in a use-example on the marshmallow page, and on the gay page we have a linked collocation, gay marriage. I think the link to gay marriage is a good thing because it's specifically bound to this one sense. Perhaps we could reword the use-example on marshmallow to use a more familiar word than badass (again keeping in mind that at least some of our readers are just learning the language), but it at least has the benefit that it's not totally irrelevant, like my galamander example.
Again, the wording of the rule in the Manual of Style suggests to me that its purpose is to keep the use-examples close to the meaning of the word being defined. If we should decide to interpret that rule strictly but allow linking of collocations, as per the gay marriage example, I would be okay with that, but I'd also say that that effectively transforms collocations into an inline version of the derived terms template. If we go that route, why not make it official and create an actual inline derived terms template? Best regards, Soap 10:02, 7 September 2023 (UTC)

Do we need an inflection line when we have a conjugation box?

In a Beer Parlour discussion, it was pointed out that English doesn't use conjugation boxes that much, instead using the inflection line. However, we do use conjugation boxes for certain verbs, mostly those with archaic endings, like run. In these cases, do we really need the inflection line? The conjugation box provides so much more information in this case, and the same type as the inflection line. Why repeat ourselves? CitationsFreak (talk) 17:34, 4 September 2023 (UTC)

I think we should keep the inflection line both for consistency's sake (people will be looking there) and because it's much more convenient. The run page is a particularly long one, with the conjugation box only at the very end, which on smaller devices might be ten screens from the top of the page. But even if we were to move the conjugation box up top, I still think the inflections should stay in the header. because it's where people are more likely to look for them based on the patterns set by other entries. Soap 18:02, 4 September 2023 (UTC)
I was thinking of having the inflection line read "see conjugation box" with a link to it, for convenience's sake. CitationsFreak (talk) 18:09, 4 September 2023 (UTC)
@Soap Made a little mockup of how I think it should look at User:CitationsFreak/conjugate. Lemme know what you think. — This unsigned comment was added by CitationsFreak (talkcontribs) at 18:32, 4 September 2023.
It's standard in languages that have both principal parts (or equivalent) and extended conjugations still to list the principal parts in the headword line: compare the formats for Latin amo, Spanish amar, Korean 없다 (eopda) and so forth. I believe that's the most user-friendly procedure in general and for English as well, so I would oppose removing all inflections from the head line. —Al-Muqanna المقنع (talk) 19:15, 4 September 2023 (UTC)
@CitationsFreak I agree with User:Al-Muqanna; we should keep the headword information. In general I really think we don't need conjugation tables for most English verbs; they just aren't that complex. Also the Wikicode of {{en-conj}} is an absolute disaster. Benwing2 (talk) 20:52, 4 September 2023 (UTC)
  • I also agree with Al-Muqanna, in general. But I think conjugation tables can be useful, for showing all the forms which are too archaic or dialectal to list on the headword line. For example, if the headword line lists not just the one usual past participle, but several rare obsolete ones, those I would be inclined to move out of the prominent headword line (and a conjugation table, with appropriate qualifiers, is a logical place to put them). - -sche (discuss) 20:57, 4 September 2023 (UTC)
    Yeah that makes sense. Benwing2 (talk) 00:10, 5 September 2023 (UTC)
    I agree. As has been pointed out, it's common to have both. The inflection line should give the most useful conjugations and the conjugation box should give a complete conjugation. It's useful to have both, even if there's a certain level of redundancy, especially since this is already our standard practice in several languages (see Portuguese or Spanish amar, for instance). Andrew Sheedy (talk) 14:37, 5 September 2023 (UTC)

Etymology and descendants of letters/scripts?

Currently most letter pages dont have an etymology or descendants section, there are a few pages with them like most of Latin, Brahmi 𑀅, Aramaic 𐡀 etc. Shouldn't it be standardized and added to all letter pages? AleksiB 1945 (talk) 13:16, 5 September 2023 (UTC)

It appeals, but there are issues with the depths of some of these scripts, and it verges on the encyclopaedic. For example, for Tai Tham and Lao I want to reference the Fakkham script, which is currently unencoded. There are also issues with the development of the Thai script - how many stages are there between the Khmer script (if that truly be the ancestor) and the current script? There's nothing encoded yet, but we have concepts like the Sukhothai script and the King Lithai script. --RichardW57m (talk) 14:48, 5 September 2023 (UTC)
What should we do for the notion of inheritance? If a script is borrowed and ultimately transformed, I would say the characters thereby transferred were inherited, but we hit the technical problem that {{inh}} is set up for languages rather than sets of characters. On the other hand, for the Vietnamese letter 'a', we currently have
Borrowed from {{bor-lite|vi|fr|a}}
Borrowed from French a
which I suppose is only mostly wrong. (Portuguese, Italian and Latin would be the starting points for the system; I am assuming that 'French' just reflects ignorance.) --RichardW57m (talk) 08:18, 6 September 2023 (UTC)
Aramaic and Brahmic pages use photos of letters which arent encoded in their descendants section, thats too much but either way we could show only the major ancestral scripts instead of all of them; a major script which isnt encoded is Pallava Grantha though. AleksiB 1945 (talk) 09:40, 6 September 2023 (UTC)

US Census statistics as a template

Hello, partly in reference to the above discussion on the inclusion of names, would anyone support converting the surname Statistics (e.g. on Johnson) into a template? I would use a syntax like {{S:US Census|1=rank|2=number of bearers|race=|percentage=|race2=|percentage2=|...|alt=(alternative name other than the page title, unlikely to require use)}}. Whilst at it, we can also make a note of those that are highly underused, like Odajyan above, which has 1 holder in the US and no mentions on Google books. Indeed, it might just be better overall to remove statistics in cases where the name is literally in last place. What does everyone think? Kiril kovachev (talkcontribs) 20:18, 5 September 2023 (UTC)

@Kiril kovachev This sounds good to me. Benwing2 (talk) 20:41, 5 September 2023 (UTC)
+1. I agree the statistics serve little function for very small numbers, I would exclude them from names pertaining to single-digit numbers of people at least (and maybe just replace them with the (rare) label). —Al-Muqanna المقنع (talk) 11:09, 6 September 2023 (UTC)
Agreed. I use something similar for {{pl-freq 1990}} but for words, and I know Surjection has made a similar template for Finnish surnames. Furthermore, I think we might want to set "Statistics" as an official header at some point. Vininn126 (talk) 11:10, 6 September 2023 (UTC)
  • @Kiril kovachev Wait, I think we already have just such a template: {{surnames-us-census}}. It's unused, but I don't now if that's just because editors subst'ed it. — excarnateSojourner (talk · contrib) 18:42, 15 September 2023 (UTC)
    @ExcarnateSojourner Oh that's interesting, it might be good to build off of that then. I think this is slightly different to the current text we have lying around, e.g. the template uses "US", whereas the text on Johnson uses "United States"; the template doesn't specify 2010 as the census year, but the entries currently do, etc., so I'd think this template may have been innovated later but not come round to being used, but Idk.
    I think it could maybe use some refinements, such as converting the "rank" into an ordinal by default, rather than requiring the caller to input "3<sup>rd</sup>" every time, and also changes to bring it in line with the currently-widespread text (bridging the differences stated above basically). But nice find!
    Fortunately I haven't done anything towards this conversion yet, so thanks for catching this before I did. Kiril kovachev (talkcontribs) 18:50, 15 September 2023 (UTC)

{{pa-Arab-translit}} does not comply with UR TR

I don't know what transliteration module Urdu is using now, but it's not compliant with Wiktionary:Urdu transliteration nor the way Urdu entries have been transliterated up until now. This means that all the transliterations that have been manually entered into entry headers up till now are different than the automatic transliterations.

I have changed Module:ur-translit (which btw is not the module Urdu is using) to match the traditional transliteration of Urdu entries and I think we should switch Urdu to that module. Especially because there has never been a discussion on changing Urdu transliteration. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 05:32, 6 September 2023 (UTC)

@Sameerhameedy There was no Urdu translit module set until a week ago (Aug 29), when I changed it to use the Panjabi translit module based on a request from User:نعم البدل. I have no issues with switching it to use Module:ur-translit, but maybe we should wait for that user to comment as to why they think it should use the Panjabi module. Benwing2 (talk) 18:49, 6 September 2023 (UTC)
@Benwing2 If نعم البدل starts a discussion about changing Urdu's transliteration policy and adopting Panjabis transliteration policy, then I would have no issue. However currently, Urdu and Punjabi have very different transliteration policies. Urdu policy treats hamza as a zero-consonant, Punjabi doesn't. Urdu policy exclusively uses dots under letters for retroflex's, Panjabi has no such restriction. That is not to mention the different character mappings because of the fact that Urdus policy only transliterates letters that have (official) pronunciations, whereas Punjabi transliterates all of them. @نعم البدل I don't care if Urdu adopts Punjabis transliteration policy but please start a discussion with other editors about changing the policy first. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:05, 6 September 2023 (UTC)
@Sameerhameedy, Benwing2, Module:ur-translit is garbage and not based on any standard. Module:pa-Arab-translit is based on the ALA-LC Transliteration standard for Urdu, and subsequently for Punjabi Shahmukhi. And it's better to use one module for the multiple languages of Pakistan, since there's practically no difference in the transliteration standards for those languages. Module:pa-Arab-translit currently suffices for Urdu, Punjabi (Shahmukhi), Saraiki, Pothwari etc. I'm all ears for opinions, but there's not even enough users to comment on the matter, or at least not many people seem to have enough of an opinion on the matter. نعم البدل (talk) 19:08, 6 September 2023 (UTC)
Yes but every Urdu entry up till now has used the previous standard. I don't really care what transliteration policy Urdu uses but please start a discussion about changing it if you don't like the current one. Changing the transliteration policy would meant fixing thousands of Urdu entries, which is a lot of work to put on other Urdu editors without asking them. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:10, 6 September 2023 (UTC)
@Sameerhameedy My mistake, you've been updating Module:ur-translit. Perhaps that calls for a discussion, apologies. نعم البدل (talk) 19:11, 6 September 2023 (UTC)
@Sameerhameedy Also WT:UR TR is based on the ALA-LC standard anyways, no? نعم البدل (talk) 19:12, 6 September 2023 (UTC)
No it's not, I'm getting confused with the Transliteration module. نعم البدل (talk) 19:13, 6 September 2023 (UTC)
@نعم البدل I don't know why you are starting an argument, I changed the module match the Urdu transliteration policy as Urdu has not had automatic transliteration until this week. Again, I don't care what transliteration Urdu uses, just please start a discussion about changing the policy if you don't like the current one. that's all i'm asking. If you start a discussion you can use whatever transliteration the other Urdu editors agree to, and I will fully support whatever transliteration that is. Just discuss it first. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:16, 6 September 2023 (UTC)
@Sameerhameedy I'm not starting an argument. If I'm coming across as passive aggressive, then apologies, but that wasn't my intention. I can continue this at Module talk:ur-translit and explain why I requested Module:pa-Arab-translit be used over Module:ur-translit. نعم البدل (talk) 19:24, 6 September 2023 (UTC)
I support transliteration of Urdu in the main space but I think some main ground rules still need to be established - what can and more importantly, what CANNOT be transliterated. Otherwise, transliterations will look like a bunch of consonants. Since vocalisation, especially full vocalisation is not a common practice yet, https://rekhtadictionary.com/ dictionary is inconsistent and is full of errors, we need to work out the rules ourselves.
  1. Common digraphs, which are represented by one letter in Hindi and the ways to spell them out. E.g. in بھائی (bhāī) بھ (bh) is an acceptable cluster, no vocalisation is required BUT it needs to be followed by a long or a short vowel, otherwise, the transliteration should fail. E.g. at گَھر (ghar) and چِڑْیا (ciṛyā) are good, گھر is also good because the module sees the ambiguity but what about چڑیا (cṛyā)? (incorrect vocalisation)
  2. Use of sukun to kill off vowels to avoid ambiguity in the middle of words.
Anatoli T. (обсудить/вклад) 02:10, 8 September 2023 (UTC)

Standard for Urdu romanization

In light of recent discussions, I invite the following users (as well as any other users who may have an opinion on this):

Hi all, apologies for pinging you all. The topic of the transliteration policy for Urdu as come up, and you're opinions on this are much appreciated.

Recently, I requested Benwing2 to set Module:pa-Arab-translit as the transliteration module for the Urdu language (previously only Punjabi, Saraiki, Pothwari). It's based on the ALA-LC Romanisation standard, the represents, specifically, Urdu letters. The Module, is not perfect, and needs to be fixed, but in my opinion serves the Urdu language (and other Pakistani languages) the best. User:Sameerhameedy recently fixed, or sorted out Module:ur-translit, which is based on the old Urdu/Hindi transliteration policy, which makes it easier to understand both Hindi and Urdu transliterations.

How should we go about this :) نعم البدل (talk) 19:50, 6 September 2023 (UTC)

just to add, if we change the transliteration policy I will change Module:ur-translit to the new policy. Since currently it is less buggy than the punjabi module (which I will also look into fixing in the future). سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:53, 6 September 2023 (UTC)
And are in accord with the opinion that it's better for one module for all of the languages, or would we want separate transliteration modules/policies? نعم البدل (talk) 19:58, 6 September 2023 (UTC)
@نعم البدل: I know I wasn't pung since I'm not active around here much, but IMO we should maintain consistency in transliteration firstly between Urdu and Hindi, and secondarily all of the Indo-Aryan languages. We have a lot of locations where both Hindi and Urdu equivalents get linked to and inconsistent transliteration will look ugly. I also don't particularly like the ALA-LC treatment of nasalisation and in general its overuse of underlines and digraphs when a single diacriticless letter works fine (e.g. kh instead of x). However, it seems like pa-Arab-translit doesn't exactly follow ALA-LC since I see x? Regardless, we should try to keep consistency with Hindi.
One point on which divergence from Hindi might make sense is the various Arabic script letters that get merged into one sound in speech (e.g. all the z's). But current practice is to not distinguish these in the transliteration. —AryamanA (मुझसे बात करेंयोगदान) 20:03, 6 September 2023 (UTC)
@AryamanA To add, I personally like how Urdu policy reserves underdots for retroflexes (e.g. ḍ ṭ ṛ). Even if we change the transliteration, I think the dot should still be reserved for retroflexes only. The fact that ص and ض transliterate as "ṣ" and "ẓ" is very confusing IMO. Also I prefer that words like کَئی transliterate as "kaī" not "ka'ī" since the apostrophe is also used for a glottal stop, and Hamza here is not acting as a glottal stop but as a Zero consonant. And I think مَیں transliterating as ma͠i is more understandable than maiṉ, since the tilde is such a universal way of showing nasal vowels. But if Urdu editors prefer that... then whatever. Besides that stuff, I don't feel strongly about any other changes. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 20:27, 6 September 2023 (UTC)
I have no issues with reserving the underdots for retroflexes. How would we go about transliterating the Hamza, or does it need to be transliterated, and if so the difference between the Ain and Hamza? نعم البدل (talk) 20:21, 6 September 2023 (UTC)
Well, current practice is to not include the hamza since it's only written to ensure vowels are paired to a consonant. Presumably, an Urdu reader would know that kaī is کئی since the ī needs to be paired to a consonant according to the rules of the Urdu alphabet. But perhaps there are benefits to including the hamza in transliterations that i'm not aware of. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 20:31, 6 September 2023 (UTC)
The only benefit I can think of is the fact that sometimes the Hamza is left out between diphthongs, and would technically become a misspelling. The transliteration may remind the user that the Humza is necessary? نعم البدل (talk) 20:36, 6 September 2023 (UTC)
Hi @AryamanA: Thanks for being part of the discussion, feel free to invite others to this discussion as well. I don't agree with the ALA-LC fully either, I don't like the fact that it represents خ as kh either, neither with how it represents ش / ژ / غ, since the ALA-LC standard is technically a romanisation standard, not a transliteration and wouldn't be opposed to create a policy that diverges from the ALA-LC. Although, I'm not too opposed with how it represents the nasal vowel, among other things, and I do think it's important to set a transliteration policy which represents how words are perceived and written by Urdu speakers, because technically, for instance, nasalisation in Urdu isn't the same as how it's perceived in Hindi, despite the pronunciation being no different. نعم البدل (talk) 20:19, 6 September 2023 (UTC)
@Sameerhameedy @نعم البدل I think we should make a decision based on what makes the most sense for Urdu, not either (a) how difficult it is to convert existing transliterations (which can be converted by bot) or (b) what the current state is. I also think transliterating Urdu and Hindi similarly is more important than transliterating Urdu and Punjabi similarly, since Urdu and Hindi are essentially the same language. Benwing2 (talk) 03:45, 7 September 2023 (UTC)
@Benwing2: It depends on what you mean by similar. Module:ur-translit at the moment doesn't produce the same results as Module:hi-translit, when it comes to nasalisation, for instance. The transliteration policy that we pick for Urdu, would technically become the one for Shahmukhi Punjabi, and other Pakistani languages anyways, and it would be pointless to have a different TR policies for Urdu and Punjabi, when they are essentially the same, when it comes to the spelling, grammar, alphabet etc. It's why I think it's better to make a policy that can work for basically all of the Pakistani languages, rather than create separate transliteration policies, like how Module:hi-translit isn't merely limited to Hindi. نعم البدل (talk) 15:23, 7 September 2023 (UTC)
@نعم البدل I don't understand why you think the translit policy we pick for Urdu would have to be the one for Panjabi in either spelling. I understand that Urdu and Panjabi use similar spelling principles but their phonologies differ, e.g. Panjabi has tone whereas Urdu does not. OTOH Urdu and Hindi are essentially the same language so clearly the translits should be as similar as possible. Benwing2 (talk) 06:01, 8 September 2023 (UTC)
@Benwing2: Because the Punjabi and the Urdu alphabets are exactly the same, and just so we're clear, you do understand I'm talking about the Shahmukhi script for Punjabi, not Gurmukhi, right? Yes, Hindi and Urdu are essentially the same language, but we're talking about transliteration right? And transliteration is supposed represent the spelling, not necessarily the pronunciation; Hindi and Urdu spelling/characters can't be mapped one to one, like they can be with Punjabi and Urdu (because the alphabet is exactly the same), so it would make sense that Punjabi Shahmukhi script would adopt the same TR policy. If we continue to use different TR policies then we'd have the word افطاری, for instance, transliterated as اِفْطَاری (iftārī) in Urdu and اِفْطَاری (īft̤ārī) in Punjabi. Btw, Tones aren't marked in Punjabi. نعم البدل (talk) 08:37, 8 September 2023 (UTC)
@نعم البدل Yes, transliteration in the linguistic sense usually represents spelling but here at Wiktionary when we say "transliteration" we really refer to more like transcription, or a mix of traditional transliteration and transcription. "Romanization" would probably be a better term since the choice of how closely to hew to spelling or pronunciation depends on the language. Yes I'm aware that we're talking about Arabic-script writing of Panjabi (aka "Shahmukhi") not writing in the Gurmukhi script. Benwing2 (talk) 08:51, 8 September 2023 (UTC)
@نعم البدل As another data point, see the discussion just below on Modern Greek translit. Benwing2 (talk) 08:53, 8 September 2023 (UTC)
@Benwing2: or a mix of traditional transliteration and transcription – which is exactly what I'm hoping to achieve. The similarities between Hindi and Urdu should be maintained (which is why I think characters like ش should be transliterated as ش (ś) (Urdu) and not ش (š)) (Arabic/Persian), no quarrels with that, but to generalise the differences, or things which can't be translated directly like the various 'z' and 's' letters seems a bit inconsiderate, and it's not even to say that the Hindi and Urdu pronunciations will always be exactly the same – they can differ. نعم البدل (talk) 09:14, 8 September 2023 (UTC)
@Benwing2: To add to that, (ṣa) is transliterated as "ṣ", even though it becomes ش (ś) ś in Urdu. Same with (ṇa) and ن (n), not to mention the 'r' diacritics and other Devanagari letters. Surely these would also be just generalised to ś, n, r etc like we're choosing to do with Urdu letters, since native Urdu speakers would have no understanding of these letters/diacritics? نعم البدل (talk) 09:27, 8 September 2023 (UTC)
@نعم البدل I understand your concern about ś vs. ṣ and n vs. ṇ, which are pronounced the same in Urdu (and usually in Hindi as well). Urdu and Hindi don't have to be transcribed identically but should be as similar as possible considering the fact that they are a shared language, and Urdu-Hindi harmonization comes first in priority over Urdu-Panjabi harmonization IMO (ideally of course all three are harmonized). It isn't consistent cross-linguistically in Wiktionary whether to transliterate two letters with the same pronunciation the same or differently (see again the discussion below for Modern Greek, which for example has 5 or 6 distinct ways of writing the sound /i/, some of which are transliterated /i/ but others in other ways). As for the 'r' vs. 'ṛ', AFAIK both Hindi and Urdu speakers make this distinction; at least, this is what Hindustani phonology says. Even if Urdu spelling is not capable of making this distinction, the fact that the the distinction is made in speech means it should be in the translit. Urdu spelling has a large number of extra Arabic-derived letters that don't correspond to differences in Urdu pronunciation, and it could be argued both ways in terms of whether we should distinguish them in the translit. For reference, Persian translit does not distinguish them: س ص ث are all transliterated 's'. Hebrew translit is the same way for modern Hebrew: כּ and ק are both transliterated 'k' while ת and ט are both transliterated 't', א and ע are both transliterated as apostrophe or left out depending on position, etc. (Essentially, modern Hebrew lost the emphatic and pharyngeal sounds but preserves them in spelling.) Biblical Hebrew is sometimes transliterated differently in a way that represent Late Biblical (specifically Tiberian) pronunciation. Benwing2 (talk) 10:02, 8 September 2023 (UTC)
@Benwing2:
  • As for the 'r' vs. 'ṛ' – Should clarify, I mean (ŕ) and (ŕ), not ڑ () / ड़ (ṛa).
  • Arabic-derived letters that don't correspond to differences in Urdu pronunciation – Generally, that is the case, but what about words like جَماعَت (jamā‘at) (जमात (jamāt)) and حَضْرَت (ḥaẓrat) (हज़रत (hazrat)), where Urdu speakers (attempt to) retain the original pronunciation and as a result have two pronunciations – a 'standard' or 'formal' pronunciation and the common/informal pronunciation (and by no means the only examples)?
  • Urdu spelling has a large number of extra Arabic-derived letters that don't correspond to differences in Urdu pronunciation, and it could be argued both ways in terms of whether we should distinguish them in the translit. The argument at hand.
I do understand that Persian and Hebrew follow a similar policy (and especially for the later, I disagree there as well, but since I'm not a speaker of Hebrew, have never really considered myself qualified enough to voice an opinion on the Hebrew TR policy, not to change the discussion but is the Hebrew TR policy formalised – last I checked Module:he-translit isn't actually being put to use?).
Also, it's not just about the letters per se. Urdu, for instance, uses a nasal vowel – a letter, to represent nasalisation, and Urdu speakers, when it came to Roman Urdu, either used a variant of n, or m (where applicable) to illustrate it. As I've said that nasalisation works differently in Urdu than in Hindi. A tilde, a diacritic makes sense for a Hindi TR, but not really for Urdu. Words like بُلَن٘د (bulãd) should just be transliterated as "buland", not "bulãd".
However, since it's clear that there's no easy solution to this, I've another suggestion which is that the source formatting should be represented in the translit and the headword/vocalised Urdu already automatically generates a Roman TR and a Devanagari TR, while a separate TR policy for the pronunciation should be utilised, which is common to both Hindi and Urdu (under the pronunciation section). That way, a reader can understand both settings. I've been trying something similar with Punjabi tonal words, and of course a module would make it much easier to handle all this, گَھر (ghar vs kàră) is a decent example. نعم البدل (talk) 10:41, 8 September 2023 (UTC)
Something similar to the new Template:fa-IPA. نعم البدل (talk) 10:46, 8 September 2023 (UTC)
Hi, none of the sources on the page حضرت indicate the pronunciation "hadrat". Is that really a thing? Current policy simply ignores letters that in official pronunciation are synonymous with another character. If you can prove that ض has a distinguished official pronunciation it could be changed. However, on the page a dictionary managed by Pakistans ministry of Education indicates the official/standard pronunciation is "hazrat". So it seems that, at least officially, ض and ز have the same pronunciation. (of course this is about current policy, your proposed changes would be different). This is not unique to Urdu if you check the romanizations for ط most Arabic script languages ignore characters without official pronunciations.

About the tilde, I can change that. But shouldn't a sukoon be used here?? Noon is already a dental, I don't see why noon would need a ghunna diacritic before another dental since it's not assimilating. And بُلَنْد (buland) (which is how the corresponding page vocalizes it) does not create a tilde. noon + sukoon is just an ordinary consonant, only noon + ghunna causes special nasalization.

Lastly if you are aware of any consistent patterns for Punjabi tones I would gladly help you make an IPA module for Punjabi in the future :) (I'm working on a lot right now so I will not be able to get to it anytime soon). While including tones in Punjabi transliteration would be very cool, I can't implement something like that unless Punjabi changed their transliteration policy, as it currently ignores tones. I can't really change Punjabi transliteration without consensus but if punjabi editors wanted tones in translit I could try to do something like that in the future. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:33, 8 September 2023 (UTC)
@Sameerhameedy:
  • Hi, none of the sources on the page حضرت indicate the pronunciation "hadrat" – Yes it is a formal/religious thing, and I wouldn't be surprised that dictionaries or references don't mention it, since the rule of the thumb is that ض, is pronounced the same as ز but in certain words it is pronounced as /d/, which is why I oppose the 'z letters' all being transliterated as just 'z', since, as I said, it generalises all pronunciations, merely to match it with Hindi. I'd say it's akin to Hindi ष and ण, yet as I mentioned earlier, we transliterate Hindi/Devanagari letters definitely.
  • if you check the romanizations for ط most Arabic script languages ignore characters without official pronunciations – Apart from Persian and not including Pakistani languages, which language that uses the Arabic script had a detail discussion on how Arabic loan-letters should be transliterated?
  • About the tilde, I can change that. But shouldn't a sukoon be used here?? – It doesn't matter whether a sukoon should be used or not. My point was it does not make sense to transliterate the noon ghunna diacritic, which goes on top of the noon, and is implied, be represented with a tilde and not just as the letter 'n' unlike Hindi's anusvara.
What I don't get is that if we do want to keep Hindi and Urdu TR policies alike, well then previously, Urdu transliterations were generated by Module:hi-translit, and it produced the exact same Hindi transliteration, then what's the need to formalise a TR policy for Urdu, or why bother with Module:ur-translit, since Module:hi-translit did the job previously?
  • Lastly if you are aware of any consistent patterns for Punjabi tones – I've noted a couple down that we can hopefully discuss, it may be difficult to create a module that can correctly generate the IPA for Punjabi lemmas, but it's worth a try, and if not the transliteration could always serve as a backup. @عُثمان could likely also point out some patterns in another discussion.
  • Punjabi changed their transliteration policy, as it currently ignores tones. – because tones aren't marked in Punjabi, and currently there is no defined way of marking tones in Punjabi in either scripts, hence not needed to mention in the TR policy? نعم البدل (talk) 21:47, 8 September 2023 (UTC)
Also, just a note, tones were never included in Punjabi's TR policy as far as I know? نعم البدل (talk) 22:08, 8 September 2023 (UTC)
@نعم البدل "Alike" != "Similar". I want them to be as similar as possible, not necessarily exactly the same. E.g. things like different representations of nasalization seem gratuitously different when the phonology is identical. Benwing2 (talk) 21:51, 8 September 2023 (UTC)
@Benwing2:
  • I want them to be as similar as possible, not necessarily exactly the same. – Even though we're talking about completely different scripts and letters can't be directly mapped to each other...
You mentioned that Transliteration is more like romanisation and not strictly a transliteration, right? So let's talk about Roman Urdu. Roman Urdu is like a psuedo transliteration/Roman script for Urdu. I don't think I've ever seen nasalisation in Roman Urdu being represented with a tilde. In fact the only times I've seen a tilde being used in Roman Urdu was to romanise the nasal vowel in Urdu, and it was done with the letter ñ – similar to ALA-LC's standard, except a different diacritic was used – but the point being that the letter 'n' is always included.
A tilde, and that too alone, to represent nasalisation in Urdu is unconventional. نعم البدل (talk) 22:01, 8 September 2023 (UTC)
@نعم البدل @Sameerhameedy
Yes, I could share some pointers on IPA transcription for Punjabi, preferably in a new thread if there is interest since it can get quite complicated. I agree that transliterations of Punjabi do not need to be concerned with tone, because Punjabi phonotactics allow the tone to be inferred from the spelling, and are governed by rules as if the original aspirate/breathy voiced consonants are still there. (For example, Punjabi words cannot have two different aspirated stops in them. A word like Bengali “abhidhān” is still not possible even though both bh and dh would lose their aspiration in most Punjabi dialects.) It is actually relatively easy to determine the tonal value for a given word; the challenging factor is syllabification and identifying the stressed syllable. In the word ਭੰਡਾਰ بھنڈار the tone is on the second syllable but in the word ਭਾਰੀ بھاری it is on the first. Meanwhile the word ਭਰਾ بھرا is monosyllabic.
Re: Urdu transliteration, I recommend following the transcription conventions used in Perso-Arabic Loanwords in Hindustani exactly (there are PDFs of this on various sites online). It makes a well informed balance between representing the words in a way which is both true to the phonology of the language and makes clear how these words are spelled in the Perso-Arabic script. This dictionary does not treat Hindi and Urdu separately and relies on multiple sources which predate the Hindi-Urdu controversy. With this in mind, the widespread use of Devanagari is recent and many Devanagari spellings are unetymological. There is no need to transliterate Devanagari spellings at all and Hindi entries should simply use transliterations generated from their Urdu spellings.
The distinction between where to indicate a nasalized vowel as opposed to a cluster with a nasal consonant should be based on the value of the preceding vowel and whether or not it occurs in the final syllable. After a long vowel in the final syllable is the only position in which tilde should be used. That is, ہوند is hõd and likewise ہُوں is hū̃ but ہوندا should be hondā. ہِند and ہِندی would both use n as in hind and hindī respectively. There is one difference between Hindi/Urdu and Punjabi in this regard: in syllables with ā in Punjabi ending in a consonant other than “v,” the nasal is still realized as a consonant. Hence Hindi/Urdu ā̃kh vs. Punjabi bhāng. Turner’s comparative dictionary follows these conventions in transcriptions. This may seem pedantic, but this is what is actually occurring in pronunciation. عُثمان (talk) 23:15, 8 September 2023 (UTC)
@عُثمان Okay, look forward to discussing punjabi with you once i'm available. I do have a question about Urdu though from what you mentioned.
noon + ghunna + gaaf should always equal "ṅg" correct? And noon + sukoon + gaaf should be "ng", correct?? It seems those are the only two types on noon before gaaf. Or is noon always "ṅ" before gaaf, irregardless??
But it seems unlike gaaf , Urdu allows 3 (4 including meem) nasals before kaaf. a nasal vowel, a velar nasal, and/or a dental nasal!! And i'm super confused how to distinguish them from eachother in the urdu alphabet.
should it be:
n + ghunna + kaaf = ā̃k
n + sukoon + kaaf = ank
n + kaaf = aṅk
??? If that's the case I have to give the module an exception for a "n + k" combination as that's currently not allowed. without a diacritic between them. n+k is the only diacritic-less exception needed, correct??
Thank you! سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 23:52, 8 September 2023 (UTC)
@Sameerhameedy To be completely honest with you here--and others may feel free to disagree with this--I think sukoon/jazm and the gunna marker are misleading in Urdu and there is no reason to use them. Urdu writers often have a preference for a "zer-o-zabar" style of writing as Pakistanis might call it that uses diacritics solely for decorative purposes. Modern Punjabi and Sindhi dictionaries published in Pakistan never use sukoon or the gunna marker.
The distinction between a nasal vowel and consonant depends on the vowel it comes after. So آنک = ā̃kh while انک = ank (aṅk). However, if there is another syllable after as in آنکا then that may be written as ānkā (āṅkā). You are correct to say that before k and g the nasal consonant is specifically the velar ṅ rather than alveolar n, but because this is always true distinguishing this detail is optional. If we say the word "sink" in English, the "n" is not actually in the same place as in "bend." There is only one possible pronunciation of "sink" though, we cannot use a different "n" sound, so no different letter is needed. Sanskrit used to represent ṅ separately in writing but this practice was ended in the modern languages which use Devanagari because there was no need to represent nasal clusters differently from one another. Whether you think ṅ is useful to show is your choice.
vowel + meem + k/g almost never occurs in Urdu in the middle of a word, but meem is always simply m. m is never allowed in clusters at the end of words in Urdu, so a vowel must be inserted if the meem + k/g is the end of the word. عُثمان (talk) 00:30, 9 September 2023 (UTC)
@عُثمان Since we are not a monolingual dictionary intended for native speakers who already know the vowels, we will use sukuun and gunna. Also I don't understand your statement "There is no need to transliterate Devanagari spellings at all and Hindi entries should simply use transliterations generated from their Urdu spellings"; this is definitely not going to happen. Benwing2 (talk) 00:35, 9 September 2023 (UTC)
> Since we are not a monolingual dictionary intended for native speakers who already know the vowels, we will use sukuun and gunna.
The transliteration should suffice to aid non-native speakers, and the rules are consistent enough that sukuun and gunna are not necessary to produce accurate transliterations.
> "There is no need to transliterate Devanagari spellings at all and Hindi entries should simply use transliterations generated from their Urdu spellings"; this is definitely not going to happen.
This is like insisting Chinese entries should not use pinyin. Modern Hindi orthography is partly ideographical in a way that Urdu orthography just is not. The word ऋषि is pronounced the exact same as رِشِی while the word आदि is pronounced آد despite the fact that both of these words are written as if they end in the same vowel. Trying to transliterate these exactly is ignoring the fact that Hindi intentionally written in a way that does not follow consistent patterns. عُثمان (talk) 00:46, 9 September 2023 (UTC)
If short vowels were never nasalized and long vowels never occurred before a regular noon, then it would be possible to transliterate Urdu without those diacritics. However, since neither of those are true, the module cannot tell wether a noon is noon ghunna or a regular noon without diacritics. Also, to prevent transliterations of blank words. The Module will go blank if there's not enough vowels. So a sukoon is needed for consonant clusters, regardless. Based on hindi translit "aṅg" and "ang" are possible but "ãg" is not possible, and is removed. So noon ghunna will represent "ṅ" before gaaf and a nasal vowel elsewhere. Since Hindi-translit allows "aṅk" "ank" and "ãk" I will have to count n+k as one consonant, since there's no other way I can think of to show that three way distinction. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 00:46, 9 September 2023 (UTC)
  • think sukoon/jazm and the gunna marker are misleading in Urdu and there is no reason to use them. – The issue is that without the use of the sukoon or the gunna marker, things can get pretty ambiguous. It would be difficult to differentiate Urdu words like حَیرَانْگی (hairāngī) (alveolar n) and آن٘کھ (āṅkh) (velar n). نعم البدل (talk) 00:48, 9 September 2023 (UTC)
    @نعم البدل I would put zabar instead on حیرانَگی since a vocalic release (IPA /ᵊ/ micro-schwa) is necessary to intervene in the cluster as /ng/ does not form true clusters. Even if that were not the case though, words ending in نگی are the only exception I am aware of and we can simply say all words with this ending use alveolar /n/. It is specifically inherited from Persian words following this exact pattern. عُثمان (talk) 00:58, 9 September 2023 (UTC)
    I've never really felt a micro-schwa here, and I couldn't hear it in UDB's recording either. It may only be in Persian-derived words ending in نگی, but we would still need to be able to differentiate between words such as حَیرَانْگی (hairāngī) and جَان٘گی (jāṅgī), and I don't see how a program can differentiate them based on the lemma alone, and without the use of sukoon/ghunna markers. نعم البدل (talk) 01:08, 9 September 2023 (UTC)
    Hmm when I compare those to نارنگی I see what you are saying. There is an affected mute pause between the sounds in the UDB recording which I would expect to hear a voiced vowel during, but this may be a genuine difference between Muhajir Urdu and Urdu as used by Punjabis. If we were to mark sukoon on حیرانگی and words like it though, I think it would be safe to assume نگ is velar in its absence. (Part of my concern here is simply that many Urdu fonts are no longer legible when these diacritics are added. The diacritics are both covered up by گ on my device in this case.) عُثمان (talk) 01:21, 9 September 2023 (UTC)
    @عُثمان:
    • but this may be a genuine difference between Muhajir Urdu and Urdu as used by Punjabis – That was my first thought as well.
    • think it would be safe to assume نگ is velar in its absence. Yeah seems fair, but it we should also include the ghunna, just in case some user does add it in, and doesn't just produce a nil error. نعم البدل (talk) 01:28, 9 September 2023 (UTC)
    @نعم البدل Hi just letting you know, while it's still a work in progress and could change. I am probably gonna follow the recommendation of Pakistans Ministry of Education (who regularly the dictionary "Urdu Lughat"), which solely uses ghunna for nasal vowels. However, since Nasal vowels cannot appear before gaaf, gaaf will probably be an exception.
    But for the sequences /ŋk/ and /ŋɡ/ urdu lughat uses no diacritics. So the module will probably treat نک and نگ (no diacritics) as a single letter. Unless @Benwing2 can think of a better solution. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 01:16, 9 September 2023 (UTC)
    @Sameerhameedy: I was so confused initially, because I could've sworn UDB did use ghunna markers, and they do use them, for both /ŋk/ and /ŋɡ/. It's when you actually click on the specific word that the ghunna marker comes up, not when retrieving results. See آن٘کھ and جن٘گ. In any case, at the minute I'm leaving Module:ur-translit in your hands and only passing my feedback. نعم البدل (talk) 01:22, 9 September 2023 (UTC)
    @Sameerhameedy Ideally there should always be a diacritic between consonant clusters so we can fail the translit if the diacritic is missing. It sounds like per User:نعم البدل the diacritic in /ŋk/ and /ŋɡ/ is ghunna, which seems a good solution. Benwing2 (talk) 01:28, 9 September 2023 (UTC)
    @Benwing2 Well we can use ghunna mark before gaaf, it seems like the MOE does that as well. But my main concern was that, Unlike before gaaf where noon only has two pronunciations, for kaaf there is a three way distinction between ãk, aṅk, and ank. I know for sure that nasal vowel + gaaf is impossible in both hindi and urdu. However in Hindi, nasal vowel + ka is possible and is distinguished from velar + ka. And unlike other phonemes, the nasal vowel + kaaf distinction in Hindi was not borrowed from sanskrit, and is present in many Hindustani words. Additionally @عُثمان seemed to indicate that sequence existed in Urdu as well. Based on what @عُثمان told me though, nasal vowel + kaaf only happens with long vowels and short vowels are always(?) velar + kaaf. If that's the case I can probably use ghunna for both the nasal vowel and velar nasal consonant. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 02:10, 9 September 2023 (UTC)
    @Sameerhameedy:
    • where noon only has two pronunciations, for kaaf there is a three way distinction between ãk, aṅk, and ank. – Once again putting forward my suggestion that noon + ghunna mark being transliterated as simply "ṉ", while alveolar n (noon + sukoon) as simply "n" could solve all this, while Template:ur-IPA should actually be used to clear up any ambiguity in the pronunciation. نعم البدل (talk) 02:26, 9 September 2023 (UTC)
    That would be no different than only transcribing the sequence as "aṅk" or "ãk" سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 03:09, 9 September 2023 (UTC)
    @Sameerhameedy Can you give some examples of the three-way contrast? Maybe that will help resolve whether we need another (ad-hoc?) symbol. Benwing2 (talk) 03:21, 9 September 2023 (UTC)
    ────────────────────────────────────────────────────────────────────────────────────────────────────
  • That would be no different than only transcribing the sequence as "aṅk" or "ãk – Yeah but, instead of having 3 ways of transliterating the noon, you'd have two, 1. when it's nasal, 2. when it's not, and it would be closer in representing the actual script. نعم البدل (talk) 01:12, 10 September 2023 (UTC)

@Benwing2 (not indenting because it's getting to hard to read.) So i've been reading various papers and found nothing however I saw this on wikipedia:
The palatal and velar nasals occur only in consonant clusters, where each nasal is followed by a homorganic stop, as an allophone of a nasal vowel followed by a stop, and in Sanskrit loanwords. However /n/ + velar clusters also occur, eg. /ʊn.kaː/ making /ŋ/ phonemic. Could not find any other information on this unfortunately. It seems as though the nasal consonant "n" is pretty stable in front of velar consonants, but the nasal vowel is not. (Which would explain why hi-translit does not allow ãg). Though in hindi entries आँख (ā̃kh) and बैंक (ba͠ik) are transliterated differently, I have no idea why that is. in lieu of any information confirming that stable nasal vowels exist before velars, We can go with the assumption that nasal vowels are always unstable before velars and assume that difference in Hindi is merely orthography and not distinctive. The only question is then, should n + ghunna + kaaf = aṅk or ãk? سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 03:43, 9 September 2023 (UTC)

> If short vowels were never nasalized
Short vowels are in fact never nasalized in Urdu/Hindi to my knowledge. Consider the following vocalizations, with velar ṅ just to demonstrate:
  • کَنک = kaṅk
  • کنَک = kanak (wheat)
Zabar should be used to indicate when نک does not form a cluster since that is the less common situation. Hindi/Urdu pronunciation does not allow consonant clusters word initially, so the first consonant should always be followed by a vowel if unmarked. عُثمان (talk) 00:51, 9 September 2023 (UTC)
@Sameerhameedy Er, I should say short vowels are never nasalized in Urdu/Hindi unless followed by /h/ as in مُنہ. I forgot about that as Punjabi always lengthens the vowel in such cases. (Hence مونہہ or مینہہ etc). عُثمان (talk) 01:00, 9 September 2023 (UTC)
I believe the current Urdu transliteration works quite well and think an overhaul would be such an extensive effort for checking every Urdu page that just isn't necessary as the current system doesn't have a significant number of ambiguities, if at all. SAA2002 (talk) 01:41, 8 September 2023 (UTC)
@SAA2002: - Hi, sorry. Could you clarify, do you mean the one that differentiates س with ص and ث etc (ie. Module:pa-Arab-translit or the one that's similar to Hindi, and just transliterates them all as 's' Module:ur-translit? نعم البدل (talk) 08:39, 8 September 2023 (UTC)
In their most recent edit (before this discussion) they transliterated اُطفاً as "lutfan", so presumably the current transliteration, as that's what they have been using. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:44, 8 September 2023 (UTC)
@Sameerhameedy @SAA2002 By "current" do you mean the one documented in WT:UR TR and used in most manual entries, or the one since a week ago based on the Panjabi translit module? I'm pretty sure the former but we need to clarify. Benwing2 (talk) 20:49, 8 September 2023 (UTC)
The one that documented in UR TR سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 20:51, 8 September 2023 (UTC)
The one that translates them all as "s" as Urdu pronunciation is closer to Persian (particularly to Dari Persian) rather than Arabic so therefore it makes sense since both languages don't make a distinction between the sounds in speech SAA2002 (talk) 15:11, 12 September 2023 (UTC)
Thanks for clarifying.
  • so therefore it makes sense since both languages don't make a distinction between the sounds in speech – but we have a pronunciation section for that, to explain the phonology of the lemma. TR should be used a learning tool, you know for a user to understand that it's not a native س (s) being employed here. Like I say, we make the distinctions for (ṇa), (ṣa), (ŕ) already. نعم البدل (talk) 21:31, 12 September 2023 (UTC)
@Sameerhameedy You can use {{outdent}} followed by the appropriate number of colons to draw a line indicating the outdenting. I seem to remember going back and forth on whether to transliterate the chandrabindu differently from the anusvara; this has probably been changed at least once. Wikipedia's article on Anusvara has a detailed section on the pronunciation of these symbols and whether they represent a nasal vowel or a homorganic nasal consonant; it seems it depends on the phonological environment, with some lexical exceptions, but it's implied that chandrabindu = nasal vowel and anusvara = homorganic nasal consonant is more theoretical than reality, with a lot of spelling alternations. Note that the existence of lexical exceptions suggests that there is in fact a phonemic difference between nasal vowel and velar nasal consonant before /k/ and /kh/, but there may not be any minimal pairs. Benwing2 (talk) 04:39, 9 September 2023 (UTC)
@Benwing2 Okay so for right now I'll make ghunna always be a nasal vowel except before gaaf and qaaf. Urdu Lughat uses noon + ghunna for words that are transliterated as ā̃k and aṅk in hindi, So i'm under the assumption Urdu has no diacritics to distinguish those. Since ghunna + gaaf and ghunna + qaaf are (seemingly) always ŋg or ɴq, but ghunna + kaaf does not seem to always be ŋk, ghunna + kaaf will not collapse into a consonant (at least right now). سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 08:48, 9 September 2023 (UTC)

About the transliteration of Greek initial ντ-

Module:el-translit currently transliterates word-initial ντ-, pronounced /d/, as d-. This clashes with word-initial δ-, pronounced /ð/, also transliterated as d-. This leads to δε (de) and ντε (nte) being transliterated the same even though they're pronounced differently, which means the flaw of the transliteration system is not its irreversibility, but rather the fact that the transliteration is making a worse job at dictating pronunciation than the original orthography, while what should generally happen is the opposite.

The solution that comes to mind is simply to transliterate the two as nt- and d- respectively. Checking w:Romanization of Greek#Modern Greek I see most systems seem indeed take this approach. The two systems that don't either spell δ- as dh- or ντ- as ḏ-. I personally don't like either of these two last approaches, though it's important to note we're the only ones merging the two sounds under the same letter, together maybe with things like passport transliteration, which isn't exactly a good starting point for a dictionary

The analogous μπ- and γκ- don't seem to cause problems as they stand. γκ- is already transliterated as gk-, while μπ- being b- isn't an issue since β- is v-. I personally would transliterate μπ- as mp- for the sake of consistency with nt-, but the systems listed at Wikipedia don't seem to worry about the inconsistency, so neither must we.

I see this had surfaced on the module's talk already, but no solution was reached. @Erutuon, Saltmarsh. Catonif (talk) 20:26, 7 September 2023 (UTC)

@Catonif Also pinging @Sarri.greek. The issue seems to be whether we translit more like the spelling or more like the pronunciation. IMO we should choose one principle and use it consistently. Benwing2 (talk) Benwing2 (talk) 06:19, 8 September 2023 (UTC)
I like the BGN/PCGN system in this respect, more phonetic but it uses "dh" for most of "δ" readings. Perhaps it should be "ð" or "dh"? Anatoli T. (обсудить/вклад) 06:32, 8 September 2023 (UTC)
@Atitarev I am inclined to agree with you, we should prefer a more phonetic rather than spelling-based representation. I think either ð or dh would work, maybe ð is better since it is a single character. Benwing2 (talk) 06:49, 8 September 2023 (UTC)
@Benwing2, @Sarri.greek: We do use "th" for "θ", which could use "θ"
δ (d) and θ (th) are a voiced/voiceless pair. They could be both either "ð" and "θ" or both "dh" and "th". Anatoli T. (обсудить/вклад) 06:58, 8 September 2023 (UTC)
I am not a regular Greek editor, but my preference would be to shift the romanizations of modern Greek δ and γ to dh and gh respectively, of word-initial γκ to g, and of medial γκ to nk (the romanization as gk in words like άγκυρα is inconsistent with the romanization of γγ as ng in words like αγγλικός).--Urszag (talk) 08:29, 8 September 2023 (UTC)
@Urszag I am broadly in agreement with you. My only potential difference would be to use IPA symbols for the fricatives, hence ɣ ð in place of gh dh, but I won't insist on this. Benwing2 (talk) 08:32, 8 September 2023 (UTC)
  • Speaking of inconsistencies between letter-based and sound-based transliterations, ει, η, ι, οι, υ are all spelled differently and pronounced the same, but our transliteration is mixed bag, as η and ι are transliterated the same (i), while ει, οι, υ are each transliterated differently (ei, oi, y). αι, ε are also pronounced the same but transliterated differently (ai, e); but ο, ω are transliterated the same (o). If we go for a letter-by-letter transliteration, then for consistency η and ι should be transliterated differently, as should ο and ω. If we go for a sound-by-sound transliteration, then everything pronounced /i/ should be transliterated the same, as should everything pronounced /e/ and everything pronounced /o/. —Mahāgaja · talk 09:02, 8 September 2023 (UTC)
    @Mahagaja Agreed. Benwing2 (talk) 09:01, 8 September 2023 (UTC)
Transliterations-policy. From latin to arabic, cyrillic, greek, han>pinyin, ... scripts. From arabic to cyrillic, greek, han>pinyin, latin... From cyrillic to arabic, greek, han>pinyin, latin,... and so on. I wish there were ISO translit-lists and standalone modules at Commons for all wiktionaries to use, as it is difficult to find all ISOs for all from.to combinations, plus monitor possible official changes.
At the moment, for greek-to-latin:
  • 1) compulsory for Modern Greek (el): according to ISO (the corresponding greek institution ELOT), as in passports, products etc. Not a matter of preference.
  • 2) a second transliteration: the popular one, which usually is more phonetic (note, that in greek passports we are allowed to use both 1 +of.our.own.choice, e.g. I am E.Sarri (Ekaterini Sarri), K.Sarri (Katerina Sarri) and A.Sarri (Aikaterini Sarri). Cashiers believe I am a crook using 3 identities.)
If your policy is 1, at el.wiktionary it is wikt:el:Βικιλεξικό:Μεταγραφές/ελληνικά#Πίνακας column ΕΛΟΤ = w:en:ISO_843. For earlier Greek from ancient up to 1982, the columns 'grc'. At en.wikt, there are interventions to ISO as in the column 'άλλα' (other), but this is in fact the following:
If your policy allows also 2 it is the last column 'άλλα (other)' which gives the 'popular' ones, as initial Nt=D, initial Mp=B (b latin), Δ=D becomes Dh. (Tricky: also middle mp can be 'b', nt=d, etc if the word is non-greek. e.g. μπαμπάς = babás, not the ridiculus bampás or mpampás. These can be done only manually.)
For transliterations of greek surnames (example at el.wikt) we are allowed to give ISO+popular+earlier (manually anyway, not having the luxury of modules).
It seems, that at the moment en.wikt uses a combination of 1&2, which unfortunately may vary and change at any moment. Since all wiktionaries copypaste from en.wikt, it would be nice if you decide on fixed policies. Would be nice to allow a second manual translit. Thank you. ‑‑Sarri.greek  I 09:52, 8 September 2023 (UTC)
I think there is a larger discussion to be had on what to do in general in cases where the spelling makes extra distinctions not found in speech. See the parallel discussion above about Urdu. Current Persian translit transliterates multiple Arabic letters the same, reflecting Persian pronunciation. The same is done for Modern Hebrew translit, Ottoman Turkish translit and for Urdu translit, although User:نعم البدل is trying to change this for Urdu. This suggests we should do the same for Modern Greek, meaning all symbols pronounced as /i/ should be reflected as i, same as e and o. OTOH there is some benefit to maintaining the source distinctions in the translit, so maybe we need a system that reflects pronunciation whenever possible but without merging differently-spelled letters (to the extent possible). But maybe it's better just to require people who want to know the source spelling to just look at the source spelling directly. Benwing2 (talk) 10:10, 8 September 2023 (UTC)

────────────────────────────────────────────────────────────────────────────────────────────────────Created before reading @Sarri.greek

The transliteration of Greek letters into Roman characters is not intended to provide a phonetic representation of a word. The correct place for that is under the Pronunciation heading.
And it is impossible for transliteration of Modern Greek to work both ways — so we need to confirm that, although it does give a simple guide to pronunciation that is not what we are about. That being the case, and if ELOT 743's guidance is to be followed the letters ν & τ can be transliterated separately as n & t wherever they occur.
Mea culpa: the error crept in in March 2013 when Module:el-translit was being created. I was contacted, it was a few years since I had created the table. I referred to the Library of Congress guidance which indeed does recommend that initial ντ transliterates as d. I think we shoudl ignore this and follow ELOT 743, and the Greek govt's advice. and transliterate them separately wherever they occur.
And Benwing's comment re η & ι >> i is correct.   — Saltmarsh🢃 10:16, 8 September 2023 (UTC)
I agree that closeness to IPA is not the scope of a transliteration. Using IPA characters in Greek transliteration is at least for me completely out of the question. I also agree on fully endorsing the transliteration (not the transcription) of ELOT 743 = ISO 843, i.e. the third column of the table at en.wiki. Sarri makes a good point mentioning μπαμπάς, if the module can't recognise the second μπ as b, to keep things fair we should keep the first digraph as mp- as well, since having the same digraph pronounced the same but transliterated differently in the same word is indeed wild, making a transliteration like bampás outright misleading. The only divergence I'd istinctively suggest is γ as n before all velars, even in initial γκ-, for consistency with mp- and nt-, but given absolutely no transliteration system seems to do this I guess I'm the weird one, and overall respecting to the letter a strong standard from ELOT/ISO is definitely a step forward from following our own preferences. Catonif (talk) 12:11, 8 September 2023 (UTC)
I hadn't realized that many of the inconsistencies (as it appears to me) in Wiktionary's romanization system are found in existing standards. It does seem simplest to follow them, although I'm now somewhat questioning the value of including this kind of romanization in the first place--we don't have space limitations, but it's not like the Greek alphabet is difficult to learn, and if the romanization isn't meant to provide an accurate guide to either pronunciation or spelling, what use is it? We don't need to create passports with an English language field for our words, after all. (For comparison, I just checked our Ukrainian entries and see that they do not seem to use Ukrainian-passport-style romanization--e.g. Зеленський is romanized on Wiktionary as Zelénsʹkyj.) Can we compare what other Greek-English dictionaries do? It looks like WordReference doesn't bother with giving a romanization.--Urszag (talk) 12:48, 8 September 2023 (UTC)
You bring up a reasonable point, most linguistic resources don't seem to transliterate Greek. Many etymological dictionaries in Latin-script languages leave Greek as it is, while they give, e.g., Cyrillic languages only as transliterated. But it must be noted though that these are all linguistical contexts where such knowledge is rightly taken for granted, while we on the other hand stand in the middle of the Internet, and I'm sure there are many people that find the transliteration very useful. It doesn't hurt anyways, when done well, aside from taking some more space. Catonif (talk) 15:29, 8 September 2023 (UTC)
I agree that Wiktionary is not the appropriate place for leaving Greek untransliterated. I think using ISO 843 is a good idea. —Mahāgaja · talk 16:00, 8 September 2023 (UTC)
I think it's good for us to have a house style. We dont need to follow anyone else's standards. I think our current system is the best one except for the issue raised up above where there are two d's ... if this evolves into a formal vote the only change I would recommend is rewriting δ as dh. Soap 16:09, 8 September 2023 (UTC)
@Saltmarsh Most transliteration systems at Wiktionary are phonetic when possible, e.g. in Russian we transliterate г as v when it's pronounced as such, and we transliterate е as ɛ to indicate that the preceding consonant is unpalatalized; but at the same time we don't apply akanje, i.e. we don't merge unstressed vowels in translit even though they are merged in pronunciation. I think it's extremely misleading to transliterate ντ as nt when that is not at all how it's pronounced. I agree with User:Soap we don't need to follow someone else's standards; other languages don't insist on following some particular standard, for example. If we follow the Russian example, we should maintain all vowel distinctions but transliterate ντ as d word-initially and nd word-medially. If there are words where ντ word-medially is pronounced as d, those require manual translit. Benwing2 (talk) 21:15, 8 September 2023 (UTC)
@Benwing2 How is nt "misleading when that is not at all how it's pronounced"? It's an orthographical rule, all languages have orthographical rules. I find it misleading Irish baothchaifeach is spelled like that even though it's "not at all how it's pronounced", yet it's not like we add |tr= to Irish, we just leave that matter to the pronunciation section. Saying that something "is not written the way it's pronounced" makes no sense. I, as an Italian, could equally say that English is not written the way it's pronounced because Anglophones don't pronounce night as */ˈniɡt/, and they could say the same of my language because we don't say */pɪsˈtætʃiəʊ/, even though both spellings make perfect sense when in the context of their respective languages. Nothing is "not the way it's pronounced", that just means it's following a different system, nor better nor worse than the system you may be used to. Who tells us that nt should stand for /nt/? because IPA? because English? Greek transliteration is Greek transliteration, and if Greek transliteration says that nt- is /d-/, there is nothing that can contradict it.
It also seems like we need to clarify, transliteration is not transcription, as Saltmarsh noted. I would like people to embrace the distinction: transliteration is orthographic, transcription is phonetic. Adding manual "transliterations" to differentiate between medial /b/ and /mb/ while still calling them "transliterations" would be an outright lie, as it would imply the two are somehow differentiated by the orthography. As emerged from this talk's branch with Urszag and Mahagaja, transliteration is meant to aid people who don't know the script, not people who don't know Greek orthography rules. Again, that's a job for the pronunciation section, as with Irish, Italian and English, as with any other language.
I'm not knowledgeable about Russian, so I'm not sure how to best address the examples you provided, but I'll try, correct me for any of my inaccuracies. If ⟨г⟩ as v can be more or less equated to how Greek ⟨γ⟩ stands for n before velars, and may be transliterated as such, then I can see why it would make sense, but as you mentioned we don't transliterate akanje, nor do we transliterate voicing assimilation (e.g. votka) and with good reasons; I can only imagine how well the Russian editing community would take these suggestions. I can equate ⟨г⟩ as v to ⟨γ⟩ as n, but I believe keeping ⟨ντ⟩ as nt is no different from keeping ⟨д⟩ as d when it stands for /t/ or ⟨о⟩ as o when it stands for /ɐ/. (I'm addressing this only for completeness, I hope the discussion doesn't fully branch out into Russian. The important parts of this reply are the first two paragraphs.) Catonif (talk) 22:20, 11 September 2023 (UTC)
Oops, I'd missed that the standards you and Saltmarsh wanted to follow were the strict transliterations (and so, you want to change the current behavior of the model). I was confused because ELOT 743 refers to (at least?) two standards, and the transliteration standard is not the one used on passports (which had been mentioned by Sarri.greek as a context where ELOT 743 was "compulsory for Modern Greek"). I definitely find it more sensible to include the ELOT 743 transliteration rather than the ELOT 743 transcription. That would mean getting rid of b.--Urszag (talk) 22:52, 11 September 2023 (UTC)
@Catonif: I support a more phonetic but not entirely phonetic transliteration. We transliterate г (g) as "v", ч (č) as š when they are (unexpectedly) pronounced as and , even though there is a grammatical reason for that.
With letters e.g. д (d) or о (o), we don't transliterate them differently, since consonant devoicing/voicing, vowel reduction is standard and predictable and can be even applied to the transliterated reading when stress is provided (and it is).
I'd like to handle μπ and ντ, etc. the same way. I treat them as digraphs.
The Greek alphabet is not hard to learn, having it transliterated more phonetically to the user is more beneficial than trying to render each letter verbatim.
Our current handling of Korean, Arabic, Persian, Thai and Khmer is much closer to phonetic pronunciation than to spellings.
@Urszag: Regarding Зеле́нський (Zelénsʹkyj), passport offices in Ukraine, Russia, many ex-USSR countries (sorry for grouping them, it's not a political statement) nowadays Anglicise surname, so "Zelenskyy" is the spelling in Zelenskyy's passport, although it could also have been "Zelensky" (with one y). Anatoli T. (обсудить/вклад) 06:04, 12 September 2023 (UTC)
@Catonif I agree very much with User:Atitarev here. I think transliterating the Greek alphabet letter-for-letter is more or less useless as Greek isn't hard to learn to read and just rendering the letters doesn't contribute so much information, esp. compared with a semi-phonetic non-lossy rendering (which is what I advocate). BTW if we can't agree on anything I would suggest as a first pass that we just fix the issue of having d mean two different things by rendering dhelta (δ) as 'dh'. Benwing2 (talk) 07:08, 12 September 2023 (UTC)
I used to like the idea of having Ancient Greek romanizations match the pronunciation, but from what I've read, a module can't determine the correct pronunciation of all nasal-stop or nasal-fricative letter clusters based solely on the spelling, so I'm reluctantly in favor of a more strict transliteration that doesn't try to match the pronunciation. I made a weird script that modifies Modern Greek transliterations to be even more phonemic, but it fails because of δε (de) and ντε (nte) having the same transcription, which is fixable, and because μπ and ντ and γκ each could have two (voiced stop or nasal and voiced stop) or maybe even three pronunciations (nasal and voiceless stop in a loanword??), which can't be determined automatically. Another ambiguity is that the second γ in γγ can be a stop or fricative: fricative in συγγραφέας (syngraféas), stop in συγγενής (syngenís). Not sure if the same ambiguity applies to μβ and νδ. To solve these ambiguities, someone would have to find the words that the module fails to transcribe phonemically and correct them. In theory that could be done, but it's work that doesn't really help readers of the dictionary that much. — Eru·tuon 01:13, 12 September 2023 (UTC)
I am sorry to have to write TL:DR, well not all of it anyway. I must leave it to others to decide.   — Saltmarsh🢃 11:19, 12 September 2023 (UTC)
  • (Building on Catonif's point) In the past, I've been sceptical of the idea of having two transliteration + transcription parameters (or at least outputting two values for tr=), but I've increasingly come around to the idea, because it's clear that both "transliterate the letters into Latin letters" and "provide an enPR-esque pronunciation as the transliteration" are things that people want. And I can see why each 'side' thinks the other idea is silly ("if you want to reproduce the distinctions present in the original letters, why not just learn the original script? why have all the distinct letters in two places, once in the original script and then again in the transliteration?" / "well, if you want to provide a pronunciation, why not just provide a pronunciation, in IPA or another system? why have the pronunciation in two places, once in IPA / in the pronunciation section, and once in whatever idiosyncratic respelling we come up with?"), but in fact maybe we should just have both...!
    (Not strictly on topic, but I have occasionally had cause to mention a word's pronunciation in e.g. some other word's etymology section, when the precise pronunciation is relevant and isn't intelligibly reflected in the transliteration/transcription. So I do see the utility to providing IPA pronunciation information in places other than just the word's own pronunciation section!) - -sche (discuss) 03:27, 13 September 2023 (UTC)
    @-sche Yes, this might successfully "split the difference". We are running into similar issues with Urdu (and Panjabi, etc.) in terms of whether to transliterate all the different Arabic letters differently even though several of them have the same pronunciations. This issue has also come up in Hebrew (Biblical Hebrew vs. Modern Hebrew, which use the same entries; current practice is to use the Modern Hebrew pronunciation but this is unsatisfactory for terms that are either primarily Biblical or equally Biblical and Modern), Persian (Iranian Persian vs. Dari/Classical), Japanese (do we present both hiragana and Latin transliteratinos?), etc. Given all the tweaking currently being done to core language modules it seems like we could definitely implement this. We'd need a syntax to allow for inputting multiple manual translits and specifying the system of each, but this should not be too hard to create. It could be argued that we should use the ts= field for transcriptions and the tr= field for transliterations but this doesn't seem a general solution esp. since some of the cases of multiple translit cannot reasonably be subdivided into transliteration vs. transcription (like the different Persian translits). Benwing2 (talk) 03:42, 13 September 2023 (UTC)
    @-sche, @Benwing2: The idea to transliterate different letters with the same sounds differently keeps coming up but the support for that idea is never strong. E.g. Persian, Urdu letter ط (t) has the same reading as ت (t), even though they have different values in Arabic. There are too many such cases. It's more problematic when for both ντ (nt) and δ (d) we use "d" (different sounds) but using "nt" for ντ (nt) when it's pronounced /d/ seems silly.
    Look, hiragana character (ha) ("ha", not "wa") is transliterated as "wa" when it is a grammatical particle to match the pronunciation (this is standard in most Japanese transliterations) アテネギリシャ首都(しゅと)である (Atene wa Girisha no shuto dearu, Athens is the capital of Greece)
    (I use it not necessarily to convince you but others who still doubt that a phonetic transliteration is common and good.) Anatoli T. (обсудить/вклад) 04:55, 13 September 2023 (UTC)
    Following -sche, to further schematise the two views I now try to summarise the arguments in favour of a transcription that have been proposed, followed by their counterarguments.
    1. The Greek alphabet is easy to learn, so a transliteration proper (that is, not a transcription) is close to useless.
      1. If we think so, we should get rid of Greek's |tr= altogether. This is not a point in favour of having a transcription. Has nothing to do with having a transcription, unless we're saying "well, we have a |tr= parameter, might as well not leave it empty".
      2. It is certainly harder to learn the entire alphabet rather than just learning that ⟨nt⟩ can stand for /d/. The #1 way to learn a script is indeed much exposure to the script accompanied by a loyal transliteration next to it. If we alter the transliteration too much from the original characters' disposition we would actually likely be confusing script-learners.
    2. Having ⟨nt⟩ stand for /d/ is not at all how it's pronounced and silly.
      1. This is making it seem like having ⟨nt⟩ stand for /d/ is some misleading and obscure dark magic. It's a straightforward orthographical rule. Finding out ⟨nt⟩ can stand for /d/ is not any harder than ⟨ai⟩ standing for /e/, ⟨ou⟩ for /u/, ⟨tz⟩ for /dz/ or ⟨ll⟩ for /l/. It is also not any harder to learn than ⟨vodka⟩ standing for /-tk-/, or ⟨yu⟩ for /y/.
    3. But compare Russian, Persian, Urdu, Japanese, etc.
      1. For Russian, my opinion is of course to be taken with a grain of salt, as I still don't know how regularly and how irregularly these ⟨g/v⟩, ⟨č/š⟩ and ⟨e/ɛ⟩ ambiguities can occur, nor do I know how common this is in widespread transliteration (or in this case, transcription) systems, but from what I've been able to understand, I find the choice of transcribing them according to their pronunciation disagreeable, and not some ipse dixit we should hold as an example.
      2. For Persian and Urdu, I'm not claiming transliterations cannot be lossy in favour of phoneticity, but rather it shouldn't be "gain-y" in favour of phoneticity.
      3. For Japanese, note how in the very example that was given, there was no manual respelling required, as the transliteration module can automatically recognise alone as the topic particle. It's also how it transliterated in most transliteration systems, as Anatoli noted. The practiced is also standardised in w:ISO 3602, the Kunrei-shiki romanisation, and although we chose to overall adopt Hepburn romanisation, absent from ISO, due to the latter being exceedingly more used, it must be noted this practice taken singularly has indeed been ISOed. We must also note how this is something that affects only one word, rather than one phoneme being used in countless words.
    (To reduce confusion I clarify that I've called "transcription" what has been called by others more or less a "more phonetic transliteration" since medial μπ, ντ and γκ would need to be transcribed manually.) Catonif (talk) 21:32, 17 September 2023 (UTC)

About the Tupi-Guarani family

The family tree used in the Wiktionary puts Nheengatu directly under Proto-Tupi-Guarani, thus making it a sister language of Old Tupi, which is not the case. Old Tupi evolved into the Amazonian General Language and this gave origin to modern Nheengatu. Neither the Amazonian nor the Southern General Languages have a ISO code, so we would have to made up some if we want to add them. These languages are also needed for the Etymology section of some Portuguese words and for the inh+ template to work in Nheengatu.

Also, recent studies have added a intermediary proto-language between Proto-Tupi > Proto-Tupi-Guarani, now being Proto-Tupi > Proto-Mawé-Guaraní > Proto-Tupi-Guarani; Sateré-Mawé and Awetí are considered cognates with Proto-Tupi-Guarani under this. If it would be added, I suggest its code to be mav-gua-pro to follow the standard.12

Trooper57 (talk) 06:50, 8 September 2023 (UTC)

@-sche Any ideas here? I don't have the background to comment. Benwing2 (talk) 21:51, 8 September 2023 (UTC)
Reclassifying Nheengatu as a descendant of Old Tupi is easy enough to do, independent of changing or adding any other codes (I have changed it now). Regarding the General Languages, how different are those from each other and from "tpw" Old Tupi And from "tpn"? This affects whether they need to be completely separate languages with their own ==Headers==, or whether they just need "etymology-only" codes for 'stages' of Old Tupi and/or tpn, so etymologies can specify that they were borrowed from that stage. With Proto-languages, are they so different that we need to be reconstructing them all; are there works reconstructing lots of words in each stage? Or can we get buy with saying various Tupi words derive from Proto-Tupi? Our codes are based on ISO codes, which "mav" doesn't seem to be, so the codes should probably start with "tup", the Tupian family code. - -sche (discuss) 23:02, 8 September 2023 (UTC)
"tpn" is labelled as "Tupinambá", so I guess it refers to the dialect of Old Tupi spoken by the Tupinambá people, rather than General Language. If so, "tpw" would be Old Tupi in general, like the diference between "pt" and "pt-BR".
The General Languages were originated from two different dialects of Old Tupi: Amazonian (or Northern) from Tupinambá and Southern from São Vicente captaincy's dialect (nowadays São Paulo). About Old Tupi dialects, the sintax was the same, the mainly diference was the pronounciation and some different nouns (like sea urchin, that was "pindá" in tpw and "pinda'yba" tpn), which gave origin to slight different lexicons in the two GL. Compared to Old Tupi, the sintax of the GL was reshapen and became closer to Portuguese, with verbs coming right after the pronoun rather than at the end of the sentence, and sounds were simplified, with loss of ɨ and ʔ and addition of vowels at consonant ending words. Not to mention the borrowings from Portuguese. Maybe we could put both under a "Língua Geral" or "Brazilian General Language" header and label words from the Southern dialect with {lb}, since it became extinct fairly quickly and is much less documented when compared to Northern even nowadays; it also didn't evolve into any new language, unlike the other.
As of now, I've only found Nikulin's work regarding Proto-Mawé-Guaraní reconstructions, with about 107 words . It's not a big deal really, we could stay with PT and PTG or "ultimately from PT".
Summarizing:
  • Old Tupi (tpw): language spoken by Brazilian indians in 16th c. and before; had many dialects, with two known for certain; extinct;
  • Tupinambá (tpn): a dialect of Old Tupi spoken in most of Brazil; extinct > evolved into Northern GL in the 17th c.;
  • São Vicente's Tupi (no code): a dialetc of Old Tupi spoken in nowadays São Paulo; extinct > evolved into Southern GL around the 17th c.;
  • Amazonian/Northern General Language (no code): evolution of Tupinambá with Portuguese influence; extinct > evolved into Nheengatu in the 19th c.;
  • Paulista/Southern General Language (no code): evolution of São Vicente's Tupi with Portuguese influence; extinct in the 20th c. with no descendants.
(in Portuguese, it's somewhat difficult to find references about this in other languages)
(in English)
Also, there's a recent effort to expand Língua tupi and related pages in pt.wikipedia, namely by @Bageense. There are also others like @NoKiAthami in English Wikimedia interested on the topic.
Trooper57 (talk) 04:33, 9 September 2023 (UTC)
@Trooper57 @-sche I would prefer if we create a language code for Língua Geral that it be named using the Portuguese term rather than English "General Language", which seems not to be used except as a gloss of the Portuguese term. Benwing2 (talk) 04:47, 9 September 2023 (UTC)
Yeah I wasn't sure of the name in English. Then "Língua Geral Setentrional" for Northern and "Língua Geral Meridional" for Southern if we want to be more specific. Trooper57 (talk) 04:53, 9 September 2023 (UTC)

鸭绿 Template interaction between zh-see and place leads to unsightly result

鸭绿 Template interaction between zh-see and place leads to unsightly result- "(“

You posted about this already at Wiktionary:Grease_pit#Problem_in_zh-see. it'd be nice to see more response, yes, but it'd also be nice to keep the discussion in one place. I replied to you there to hopefully get more discussion going. Soap 14:20, 8 September 2023 (UTC)

"rare" and "uncommon" once again

These are not currently defined in a particularly helpful way ("rare" at the glossary just says not used commonly, and "uncommon" says not common but more common than "rare"). There doesn't seem to be much rhyme or reason to how they're applied in practice: on a few occasions I've even seen them apparently being used to smuggle in prescriptive opinions, i.e. slapping "rare" labels on senses someone doesn't like or isn't personally familiar with.

I doubt a specific quantitative threshold will be useful cross-linguistically, but it would be good to work out a definition that can be applied more consistently by editors than the current ones. My personal rule of thumb has been that a word is "rare" if it takes more than minimal effort to find adequate attestation, and "uncommon" if it's not that difficult to find but a reader (even a specialist) is very unlikely to encounter it organically. This issue has been discussed before, most recently I think here in Jan. 2022, but it hasn't come to a conclusion. —Al-Muqanna المقنع (talk) 18:46, 8 September 2023 (UTC)

@Al-Muqanna Thanks for bringing this up. Last time this came up I think I advocated for merging the two on the basis that they weren't and couldn't be distinguished consistently. I think your distinction makes sense although it definitely requires human judgment, and I have my doubts about whether people will actually follow it. Benwing2 (talk) 21:42, 8 September 2023 (UTC)
I thought some linguists have ordinal scales for how humans interpret such words and phrases. I'm pretty sure that we can help users more by maintaining some kind of distinction. "Rare" certainly means less common than "uncommon". Neither is normally applied to words that are not considered principally to be used in some particular register or usage context. I expect that most normal users would use our definition to decode whatever passage they found the term in and forget it whether the label read "rare" or "uncommon". Some users (writers, mostly, in my vision, like Thomas Pynchon) might differentiate and decide some uncommon words they happened to like were worth deploying. Or are these tags just useful for us? Maybe they are just our way of expressing displeasure at a word that required more effort on our part than seemed worthwhile? I certainly need some similar motivation to bother with that kind of label. DCDuring (talk) 00:11, 9 September 2023 (UTC)
@DCDuring For foreign languages, rare and uncommon are very useful in indicating senses that don't occur much; otherwise, the more common senses get overwhelmed by the multitude of uncommon ones. For English specifically, with the target being native English speakers, the target users already have some sense of whether a definition is rare/uncommon because in that case they won't know it; but even then I think the tag can help people (both native English speakers and L2 speakers) trying to write good English, to know whether they have to be careful with using a particular sense. Think of ludicrous machine-translated Chinglish where the machine used obsolete or rare senses of English words, or Kim Jong-un trying to insult Trump and coming up with "dotard"; in both cases the dictionaries clearly didn't do a good job of identifying the terms or senses as rare/uncommon. (For that matter, we don't label "dotard" at all; the only hint that it is dated or rare is that all cites, except the one from Kim Jong-un, are <= 1867.) Benwing2 (talk) 00:31, 9 September 2023 (UTC)
Yes, I am mostly interested in English and don't expect to impose what we should IMHO do for English on any other languages.
I'm certainly not objecting to having them both, but I doubt that there is any point in quantifying. We usually stop at three to five cites per definition. The tag "rare" sometimes results from "It took a lot of searches and corpora for me to get even three good cites". "Uncommon" often has an element of 'low frequency relative to synonyms'. If no learner dictionaries and some other 'unabridged' dictionaries don't have it, that's also support for some kind of tag. In the absence of something better, that can be 'uncommon' or, if, only or not even the OED has it, 'rare'. DCDuring (talk) 01:33, 9 September 2023 (UTC)
@DCDuring: I think your description here is basically comparable to my personal understanding of the terms, i.e. that "rare" means hard to find even when you set out to find it and "uncommon" just means notably infrequent but not necessarily hard to find. —Al-Muqanna المقنع (talk) 12:27, 9 September 2023 (UTC)
  • How about we decide on an arbitrary small number, like 7, and say that anything with less than 7 hits is classed as rare. It would be an awesome vote to decide what the threshold would be, the 7-brigaders battling against the 6-brigaders, I can't wait! Jewle V (talk) 00:19, 9 September 2023 (UTC)
    Lol, this is what I mean about trying to set fixed thresholds probably not being very useful. —Al-Muqanna المقنع (talk) 11:22, 9 September 2023 (UTC)
  • These labels stack with others. If a term is slang and rare, it will be extra-hard to find and unlikely to be “organically” encountered, whatever that means – guess it means doomscrolling arguments on the internet again which you previously decided not to rebrowse. Then the rule of thumb is applied in a hypothetical, extrapolated fashion—typical Belizean English is what I would not find with a certain likelihood if living in Belize or consuming Belizean sources which aren’t necessarily preferred by search engines to open us the windows into other worlds. There is reason but no rhyme. Fay Freak (talk) 01:06, 9 September 2023 (UTC)
    "Organically" means encountered in the ordinary course of one's linguistic experience rather than searching for the term so you can find attestations for Wiktionary. My qualification on "even a specialist" (perhaps better put as even a specialist or member of the relevant community) was intended to encompass slang and jargon, i.e. an Internet slang term will certainly be uncommon for people who don't use the Internet much but should only be marked "uncommon" if someone who does is notably unlikely to ever encounter it. —Al-Muqanna المقنع (talk)
I would urge studying the theory of how other dictionaries use "rare/uncommon" and then comparing words considered rare/uncommon elsewhere to those Wiktionary has or has not labeled rare/uncommon. --Geographyinitiative (talk) 11:45, 9 September 2023 (UTC)
English dictionaries don't generally use an "uncommon" label, afaik, they just have "rare" if they bother at all (Merriam-Webster doesn't as far as I can tell). The OED does have technical frequency bands based on occurrences per million words (see here) but they don't appear to apply the label "rare" based on these bands, e.g. they list gurhofite as an example of the least common band 1 but it is not marked rare in the entry, perhaps because it's mineralogy jargon and so expected to be low-frequency; grithbreach, another example they give, is not marked rare either, just obsolete or historical. abaxile, on the other hand, is indeed marked "now rare", despite the relevant sense being botany jargon. —Al-Muqanna المقنع (talk) 12:16, 9 September 2023 (UTC)
So what I take from this discussion is that we do it right, better than anyone else, and you just want to redefine the Appendix:Glossary definitions, to match the actual usage, and a definition at all … fine. Also fine if you weren’t sure at first what you want and everyone had to make a point on it. Fay Freak (talk) 12:30, 9 September 2023 (UTC)
@Fay Freak: The point is to settle on more helpful glossary definitions which can be applied consistently, yeah. I don't think we necessarily do it right since it's applied inconsistently in practice, but if there's already an understanding shared by most established editors then we should update the glossary and it'll be easier for readers to understand and other contributors to follow it consistently. —Al-Muqanna المقنع (talk) 12:35, 9 September 2023 (UTC)
I would perhaps suggest that the use of "rare/uncommon" may be a kind of defect if the other dictionaries do not use "rare/uncommon", and that some better method of quantifying relative usage may be in order, perhaps piggy backing on OED or Google NGrams. I come from the perspective of seeing Wiktionary as in a "primitive" stage of development. --Geographyinitiative (talk) 13:17, 9 September 2023 (UTC)
@Geographyinitiative I disagree, rather I honestly think Wiktionary often does a much better job in deploying specific judgements about the nature of the words it covers. If other dictionaries don't mark a word as "rare", when in fact it really is rare, then I think we are legitimately being more precise about it. I don't know how this translates into editors' practice, but principally distinguishing rare and uncommon entries from the general vocabulary is helpful in my opinion, and stops you from sounding weird if you try to use a word somewhere in the 100,000s place in the frequency list. As long as our internal definitions make sense, and the glossary helps readers to understand them, then surely all we need to do is realize that in practice. Kiril kovachev (talkcontribs) 13:32, 9 September 2023 (UTC)
I would like to know: what does Wiktionary's "rare/uncommon" terms match to in OED's technical frequency bands? If there's a correlation between Wiktionary's existing designations and OED's technical frequency bands, that could be something worthwhile to add to an updated definition of "rare/uncommon". Or, perhaps "rare/uncommon" is not connected to actual frequency and is merely a judgment compared to similar words. If the designation is subjective rather than "corpus frequency", that might be something to mention in a revised definition. The current definitions of "rare/uncommon" could mean literal "corpus frequency"; a revised definition might try to say "relative to related terms" or something like that. But there has got to be a reason why the other dictionaries may not delve into using these words- what is that reason? Wiktionary has the capability to be superior, but is it actually superior? -Geographyinitiative (talk) 13:39, 9 September 2023 (UTC) (modified)
We don't define them quantitatively so they don't correspond to anything. Similarly the OED's "rare" label also doesn't in practice correspond to any of its frequency bands. To be useful, they need to be somewhat subjective: any kind of jargon is going to be uncommon in absolute terms compared to non-jargon, but if you add a label like "(sociology, uncommon)" then you're implying it's uncommon even within sociology. Adding "uncommon" simply because its absolute frequency in the entire language is low is probably not helpful to readers, and we don't in general do that anyway. —Al-Muqanna المقنع (talk) 14:12, 9 September 2023 (UTC)
This is beginning to sound both coherent and realistic and approaching consensus! DCDuring (talk) 14:47, 9 September 2023 (UTC)
I don't create a lot of entries, but I also label words "rare" if it is difficult to find cites (e.g. clattawa, which includes all the citations I was able to find online, with a certain amount of effort, over a period of a several years). I haven't used "uncommon" much, if at all, but I have understood it in the way described by other editors here: unlikely to be encountered, but not hard to cite. FWIW, I have also used "very rare" for words or spellings that are just barely citeable. I'm not sure if that's something we want to be doing or not. Andrew Sheedy (talk) 16:27, 9 September 2023 (UTC)
That's more or less my take. I've only used "very rare" myself at abstringe, where it seems possible that the 3 citations (if accepted) are literally the only instances it's ever been used independently. —Al-Muqanna المقنع (talk) 17:27, 9 September 2023 (UTC)
I probably would label as rare a French word that may look like 'the' translation of a common English word, but isn't actually used nearly as much as its English counterpart. Don't have any example at hand right now though.
Another case would perhaps be a word that is formed according to regular derivational processes but isn't used as much as might be expected. For example, déconnade and déconnement both seem to me to be perfectly regular and "reasonable" derivations from déconner (a very common verb), but for some reason only the first one sounds normal to me. I've absolutely never heard or read déconnement before: I was looking in my head for an example of what I meant and I thought of this one. It happens to be attested, but it still feels like a weird/rare word.
In any case, I agree this is all very fuzzy and the label is almost meaningless at the moment. If we want it to be useful, we should get into the habit of providing meaningful comparanda (i.e. "rare compared to what?"). PUC16:38, 9 September 2023 (UTC)
Example at hand: predator, predatory, sexual predator and so on, which don’t exist as set terms in other languages. But the translations get specific glosses. Fay Freak (talk) 17:18, 9 September 2023 (UTC)

Granularity of reading types in {{ja-kanjitab}}

How precise do we want to indicate the readings used in {{ja-kanjitab}}? For example, in this here edit on 女体 (にょたい), I indicated the reading of the kanji as being go-on, which I would say is an improvement in this case, because the whole compound is made of the go-on readings. However, what about the other kanji tab on this page for じょたい? Do we want to specifically notate this as kan'on + go-on? At this time, it's being shown just as "on'yomi", which is indeed helpful, but I'd like to know whether we should do our best to convert these to be as precise as possible. In my opinion, this would be constructive for readers as it would show the specific types of kanji readings that are used throughout the Japanese vocabulary, but that's just my opinion. Kiril kovachev (talkcontribs) 13:24, 9 September 2023 (UTC)

In general, I try to be as specific as possible in {{ja-kanjitab}} reading types.
I confess that it has bothered me for some time that we have no way of clearly specifying when a kanji reading is both kan'on and goon; we are left with just on for generic on'yomi, which is also what gets used when an editor hasn't taken the time to look up reading types (more common for older entries), or for when the resources we have to hand don't specify. I dislike the ambiguity.
Back to your specifics, yes, for the じょたい (jotai) reading on the 女体 page, I would specify {{ja-kanjitab|じょ|たい|yomi=kanon,goon}}. ‑‑ Eiríkr Útlendi │Tala við mig 17:07, 11 September 2023 (UTC)
@Eirikr I see, then our ideas are aligned I do think in this regard. I also would like to accurately reflect the reading types where possible, but like you say I don't know what to put when the kan/go on readings are the same. At the end of the day if it just says "on", then this could most likely only refer to kanon or goon anyway, since any other on reading type is very rare; but this isn't awfully verbose or clear to readers of the kanji tab at a glance, and of less educational value too. I wonder if there isn't a solution to this, such as e.g. a "compound" reading specify kanon/goon? But this would require changes to the kanjitab template.
Thanks for your feedback, anyway – I now updated that particular page, and I shall keep on doing this for any entries I create in the future. Kiril kovachev (talkcontribs) 19:17, 11 September 2023 (UTC)

Too much concealment of quotations and synonyms in short English L2s for multiword expressions

At an entry like where it's at, we have hardly any content visible to a user. (There was less before the See also was added.) I think this makes our entry look pathetic for no good reason, as there is content. Do others agree?

If so, can our talented technology mavens work their magic to make sure that collapsible quotations and synonyms don't make such an L2 shorter than, say, 20 or 25 lines when there is concealed content? DCDuring (talk) 23:27, 9 September 2023 (UTC)

really long single-kanji readings

@Kiril kovachev Your bot changes to remove redundant translits have resulted in thousands of new categories getting created, often with long readings for single kanji. For example, lists a kun reading まゆをひそめる = mayuwohisomeru, which duly results in Category:Japanese kanji with kun reading まゆをひそめる and its parent Category:Japanese kanji read as まゆをひそめる. Is this a real reading? The Hiragana entry for まゆをひそめる soft-redirects to 眉を顰める; this entire phrase is pronounced mayu o hisomeru, and the Kanji that is claimed to have the reading mayuwohisomeru is listed here with reading hiso, which seems to make more sense, although I don't know a lot about Japanese, so maybe this extremely long reading is kosher. Can someone who speaks and reads Japanese tell me whether this reading and the others listed in Special:WantedCategories from position 752 through 1829 or so (which include katakana readings towards the end) are real or spurious? Benwing2 (talk) 08:12, 10 September 2023 (UTC)

@Benwing2 Hello, thanks for bringing this up, and in truth I was going to post an extended post about this change and how to clean up kanji entries in general, but let's address this first. I apologize very much if this has lacked some degree of foresight, as personally I didn't know this would flood WantedCategories with all this drivel; but ultimately we already had these readings sitting there latent, and the only reason these categories haven't surfaced till now is the syntax previously suppressing the category generation. May I ask, is the creation of categories an automatic process? I was aware that many kanji had had a category generated as a result of this, but I didn't know category pages are also being created in such amounts because of that.
As for the readings, I am extremely dubious that most of these kun readings are real whatsoever, in fact they strike me as simply an explanation of the character's meaning as opposed to a "reading": no one is likely going to see this kanji 顣 and read it as an entire clause. There are loads of these instances in our coverage, which I would suggest are mostly mistaken. The general limit on common kanji readings' length is 5 kana, with the most notable readings of this kind being (うけたまわ) (uketamawaru) and (こころざし) (kokorozashi). As a rule of thumb, therefore, there should be exceedingly few valid kun readings over 5 long. Those are probably just phrases.
Additionally about the kanji in that list:
  1. I believe all the ones that end in "さま" are just meanings, not readings, so those are not legit. Same with most ending in "する".
  2. The ones that contain a を are entire phrases, so also probably not valid. Alternatively, they're spellings in the old orthography, so still wrong, but maybe fixable.
  3. This was a consideration for kanji reading cleanup as a whole: a great number of kanji readings in that list are for verbs (so they end in one of the kanas ending in "u", うくすむぬつふる), but they have no okurigana suffix to show which part of the verb the kanji corresponds to (the rest being written in kana), e.g. "盈" supposedly has reading (みたす) (mitasu), but the usual spelling of this verb is 満たす, suggesting that 盈's reading should actually be み-たす. Similarly, many of the readings end in い, of which a lot are i-adjectives. For those that are, they should never contain the い as part of the whole reading, and the end of the reading should fall somewhere before the い, e.g. (もろい) (moroi) should be もろ-い.
    Because some dictionaries don't show the okurigana placement, this can make it difficult to tell exactly which parts the kanji are meant to represent. Sometimes, it's not the same portion of the underlying word as the primary spelling, either. This means to me that we may have to manually decide (a) whether some of these alternative readings (for the valid, short-ish words) are well-attested to begin with, and (b) what parts are used for the kanji reading and what parts are written in kana.
  4. The katakana ones are dubious to me as well, but this ventures into the territory of archaic and esoteric, which I'm not familiar with: it's possible that some of these were used in the past, e.g. 銥 may possibly have been in use for イリジウム (irijiumu), but I've never seen it myself. However, (キロメートル) (kiromētoru) might be okay, along with related terms, because (メートル) (mētoru) is occasionally used.
  5. There are some old-orthography readings that have slipped through the cracks as well, e.g. (はうむる) (haumuru), which should be ほうむる<はうむる (with a - somewhere, requiring some research); some patterns can be filtered, e.g. を, ゑ, ゐ, はう, かう, まう, etc., but again in need of investigation.
If this is too much to handle at once, perhaps I can undo the syntax removal? That would temporarily remove these categories and allow us time to fix the readings in general, if you would prefer that. Otherwise, I do feel this will still require lots of manual checking, even after we're able to prune the obviously dubious readings. I would of course be happy to take responsibility and begin to check those in the list in order.
Hope this was of some help, Kiril kovachev (talkcontribs) 11:09, 10 September 2023 (UTC)
Also, I suggest a good litmus test for whether something is really a (common) kun'yomi: check on https://kanji.jitenon.jp/. It doesn't always have all the readings we do, but it does specifically delineate between 意味 (meaning) and 訓読み (kun'yomi), so if some supposed kun'yomi is listed on that site as a "meaning", as opposed to a kun'yomi, then it's probably invalid. E.g. 人 is given the kun'yomi ひと, whereas 塴 simply 'means' ほうむる. Kiril kovachev (talkcontribs) 12:27, 10 September 2023 (UTC)
@Kiril kovachev The creation of categories in Special:WantedCategories is something I manually run a script to do, usually every 3 days, since that's the frequency at which Special:WantedCategories is refreshed. I usually check the categories when I run the script, which is how I caught these cases. If I catch the issue before the script runs, I can tell it to filter out the bad categories but in this case I didn't notice until the script had almost finished the Japanese categories, so I let it run. Once we've fixed everything up, the bad categories will end up in Category:Empty categories and we can delete them. I understand that the problem was already there before you ran your script; I wouldn't recommend undoing what you've done because the categories won't disappear (only become empty), and the problem will still be there. I don't know that much about Japanese so you'll have to help delete the bad readings, but I can help as much as possible, e.g. I can make a list of the potentially bad readings and you can manually filter out the good ones, leaving the bad ones to be removed by bot. Benwing2 (talk) 19:18, 10 September 2023 (UTC)
@Benwing2 Okay, that's fine. I'll engage myself with fixing the readings as much as possible until the categories can be cleared. The problem with some of the readings, such as the okurigana placement I mentioned above, are not so clear-cut, but we can get rid of all the obviously shoddy ones. Could I enlist your help to filter out some of those? Could you please create a listing of all the ones that:
  1. Contain を, ゑ, ゐ, or any of followed by う (first 3 indicate archaic characters, or inappropriate reading in the case of を final is one of a few common old spellings); or
  2. Are longer than 5 characters. There are some 6-long out there, but the majority should be false readings; or
  3. End in さま?
The katakana ones might well be valid so I'll check all of those manually, but I believe these three points make for unlikely readings. Also, if you're quite busy, please let me know and I'll generate them myself, so you don't need to waste your time. Thanks for your help, Kiril kovachev (talkcontribs) 20:38, 10 September 2023 (UTC)
@Kiril kovachev Sure, will do. Benwing2 (talk) 20:47, 10 September 2023 (UTC)
@Kiril kovachev: I notice a bunch of categories like Category:Japanese kanji with kun reading あお-ぐ that have a hyphen in them (this category has 5 chars in it). Should they exist or should we be putting these characters in Category:Japanese kanji with kun reading あお, chopping things off at the hyphen? Benwing2 (talk) 22:29, 10 September 2023 (UTC)
@Benwing2 I'm pretty sure these are correct. This is the standard practice as far as I can tell. If anything, the ones with no - where there should be one should be changed, which I'll be looking to do as part of my sweep. I think you should ignore the - as part of the character count anyway. Kiril kovachev (talkcontribs) 22:35, 10 September 2023 (UTC)
@Kiril kovachev: I *think* this is the full list meeting the above criteria (it's 382 categories): User:Benwing2/bad-japanese-reading-cats Benwing2 (talk) 00:48, 11 September 2023 (UTC)
@Benwing2 Thanks! I'm now going to start checking, I'll let you know my findings. Kiril kovachev (talkcontribs) 10:13, 11 September 2023 (UTC)
@Kiril kovachev I also ran it while checking for categories with 5+ chars. The result is here: User:Benwing2/bad-japanese-reading-cats-5-or-more. There are 833 lines here. Benwing2 (talk) 10:31, 11 September 2023 (UTC)
@Benwing2 Thanks for this again. Kiril kovachev (talkcontribs) 12:05, 11 September 2023 (UTC)
Thanks for generating this. Since we had issues last month(s) with the monthly subpages not showing up on the main WT:BP page because they were too large to transclude, might I suggest moving this list to a sandbox/userspace page and just linking to it, lest we (now at 229k bytes with two-thirds of the month left to go) get too large for WT:BP again?😅 - -sche (discuss) 03:18, 11 September 2023 (UTC)
@-sche Done. Benwing2 (talk) 03:53, 11 September 2023 (UTC)
Pinging the most-recently-active native Japanese speakers who also list English in their Babel boxes, User:MathXplore and Lugria. The question is whether (e.g.) 顣 by itself has a reading "まゆをひそめる", or whether it only has a reading of "hiso". (You might or might not also have an opinion about where to mentino vs not mention ruby, discussed at Wiktionary:Beer_parlour/2023/September#Automatic_transliteration_of_katakana_and_hiragana.) Additional ping to User:Eirikr. - -sche (discuss) 16:14, 10 September 2023 (UTC)
@-sche I very highly doubt that まゆをひそめる is even possible as a reading, because I've never before seen a reading that doesn't spell out を as its own particle. As a whole を is spelled in kana 100% of the time from what I've personally seen, so seeing it hidden behind a kanji doesn't seem right at all to me. But I'll let those who know better weigh in as well. Kiril kovachev (talkcontribs) 20:40, 10 September 2023 (UTC)
I also have the same doubt like this. MathXplore (talk) 07:45, 11 September 2023 (UTC)
According to this online kanji dictionary (), the only given reading is "shuku". I also confirmed this at page 1564 of 新漢語林 第二版 () from w:ja:大修館書店. No other readings were found. On the other hand, the online kanji dictionary that I used above explains the definition as "ひそめる", so this is likely related but cannot find "hiso" as its reading. I hope this can help you. MathXplore (talk) 07:54, 11 September 2023 (UTC)
@Benwing2, @MathXplore, @-sche:
In relation to this very partial investigation that I began just a few hours ago, something has already become clear to me, which is that many of our kanji readings pages were generated automatically a long time ago (2003) by User:NanshuBot, and then later reformatted to its current form. Anyway, the important thing to note is, they are all heaved over from KANJIDIC. What's more, these same reading sets that were present in the KANJIDIC data set back in 2003 have by now proliferated alll over the internet, making for countless sites that appear to corroborate the original, apocryphal reading. These characters are so obscure, they don't fit into the usual kanji dictionary that I search (学研漢和大字典), and finding a legitimate, in-the-wild usage is virtually impossible. The pathetic 7 or so readings I was able to scrawl through were the product of over an hour of trawling over data and deciding whether it's a KANJIDIC derivative, and then in turn trying to figure out where that original KANJIDIC reading even came from, and whether it should be kept in the first place. Indeed, KANJIDIC for these rare readings also doesn't bother to place the okurigana location, which for our purposes means that it doesn't help in whittling down which categories should be kept or not.
We could reject readings that appear to be only be present in KANJIDIC as a rule, but it could also be possible for some of them to be valid after all, so I don't know. I think I would like to email the KANJIDIC maintainers to ask about the sources of readings, if they are gathering them in some systematic way. Maybe I need to also consult Dai Kanwa Jiten for some of the most obscure readings as well. But checking through these all in this way will still be very cumbersome.
What do you all think we should do? Kiril kovachev (talkcontribs) 12:26, 11 September 2023 (UTC)
For starters, I think any kanji whose kun'yomi seems at all dubious and which is sourced to KANJIDIC should be added to some kind of maintenance category, flagged for further review, as it were. For instance, the character mentioned above appears in the Weblio aggregator site here, and it seems to indicate that this only appears in KANJIDIC -- no other resources or entries are listed.
I too have noticed this over time -- many obscure kanji characters are included in KANJIDIC and the Unihan database with Japanese readings, but digging further reveals that:
  1. The kun'yomi given in both resources is often a gloss, and not a valid Japonic reading.
  2. The character in question isn't even used in Japanese text at all, or if at all, only exceedingly rarely, and often in cases where the Japanese text is worded like "this is the character used in Chinese to spell the word iridium..."
I am not sure if KANJIDIC cribbed from Unihan, or the other way around. At any rate, both resources strike me as extremely dubious for rare Japanese kanji. Any of our entries sourced to these will need vetting. ‑‑ Eiríkr Útlendi │Tala við mig 17:19, 11 September 2023 (UTC)
@Kiril kovachev Thanks for the investigation! I would agree with User:Eirikr and add that anything dubious like this should be placed in some sort of "check" template that essentially quarantines the dubious reading, similar to {{t-check}}. Such template should not add the reading to any categories. Probably this can be done in an automated or semi-automated fashion. Benwing2 (talk) 18:51, 11 September 2023 (UTC)
@Eirikr Thanks very much for this input, I'm glad to know these aren't really used in Japanese, and we can take a slightly less difficult approach and flag things directly that we doubt.
@Benwing2 About the check template, do we want it to continue to display the reading to users, or just effectively comment it out, whilst still allowing it to be tracked? If we want to display it, I just wonder how we can do this while preserving the format of the readings template, since any text we try to input appears to mess with the reading itself.
If we don't want to show the reading, I guess just an empty template with no output would do? Just wrapping the reading in e.g. {{ja-reading-check}} would make it vanish but still track it.
Additionally, about the automation, I believe a good bit of work can be saved by scraping the various reading aggregator sites, ideally independent ones such as kanji.jitenon.jp (that aren't based on kanjidic nor on Unihan) to check whether the relevant readings are provided or not. Out of the many entries in the pile to check, it's quite likely that many can be flagged like this. Kiril kovachev (talkcontribs) 19:09, 11 September 2023 (UTC)
@Kiril kovachev If you look at {{t-check}}, it continues to display the translation but with a displayed indication that it needs checking, and it also adds the translation to a request-for-check category. This would be ideal, if it's not too much work to implement. As for the automation, if you can give me more detail on how to do the scraping and how to check the relevant readings, I can probably implement it. Benwing2 (talk) 20:03, 11 September 2023 (UTC)
BTW maybe instead of wrapping it in a separate template, we can add some flag to the readings template (whichever one that is) to indicate that a reading needs checking. Benwing2 (talk) 20:05, 11 September 2023 (UTC)
This could be as simple as an asterisk or other symbol before the disputed reading or whatever. Benwing2 (talk) 20:05, 11 September 2023 (UTC)
@Benwing2 That's a good idea, because I'm not sure how we would do it inline: each reading generates a link to itself, right? Therefore if you just insert a template in there, it will interact with the link; e.g. if you try to input <sup> tags, this triggers the < syntax of the template, which specifies older forms of the reading. If you input ordinary text, that in turn is considered part of the reading, and gets linked to, polluting the desired link. Maybe there's a more advanced solution, as I'm not all that experienced with templates, but actually your suggestion is great and I'd rather just put a little indicator to suggest some of its readings need checking, if that's possible. I guess it'd just need a small check on behalf of the readings template: I can try add the asterisk option if you're agreed with that.
Re:scraping, I can try making that script myself, but I can cover the details if you'd like too. We need to get the data, probably from HTML since there aren't any good APIs*, from a few sites: (*that I know of, maybe there are?)
  1. kanji.jitenon.jp: get the content page via the URL https://kanji.jitenon.jp/cat/search.php?getdata=4eba&search=match&how=%E6%BC%A2%E5%AD%97 where in this case 4eba is the hexadecimal value of the Unicode code point value (for 人). Then we need to access the kun'yomi (we should focus our attention on checking kun'yomi, as these are the vast majority of the dubious readings) from the readings table on that page, e.g. ひと. This looks slightly complicated, so I'm not sure on the exact procedure yet, since the table doesn't have distinct IDs or classes for its elements, so some amount of decoding the thing might be required.
  2. weblio.jp: this site is one of the ones that is sometimes corrupted with the dubious readings. We can check whether these are present by scanning for text like [訓]ひと (the ひと could be any hiragana or katakana) at https://www.weblio.jp/content/<kanji here>. If this kun reading matches what we have in our coverage, but the same content is not reflected on kanji.jitenon.jp, it is likely that we can flag the value as dubious. Its URLs are just https://www.weblio.jp/content/<kanji here>, so getting the content is quite easy. It's then a matter of searching with the desired pattern, which should hopefully work broadly over all possible kanji. But bear in mind that the readings are indicated in <b> tags, which in turn are nested inside <p> tags that contain the [訓] label in them.
We could check more sites than this, for instance the KANJIDIC database itself (via an API at http://nihongo.monash.edu/cgi-bin/wwwjdic?1B): if you do
r = requests.post("http://nihongo.monash.edu/cgi-bin/wwwjdic?1D", data={'kanjsel': 'X', 'ksrchkey': '<kanji goes here>', 'strcnt': ''}), that gets the HTML contents of the lookup page, and then the kun readings in <b> tags, immediately following the (in <font>) tags) would need to be read. But perhaps just one of this or Weblio would do.
The check at KANJIDIC would check that our obscure reading is probably from there, and the check at independent Jitenon would confirm whether that reading is actually widespread or not.
Maybe there's an easier way to access the content from Jitenon and Weblio, but idk.
As you can see it is quite a lengthy process, so if you want I can try to handle it. But anyway let me know if you want to do it yourself. Thanks for offering to help! Kiril kovachev (talkcontribs) 20:54, 11 September 2023 (UTC)
@Kiril kovachev Yes, maybe you should see if you can implement it since I'm not very experienced in scraping web sites and you probably have a better idea of what you're looking for. I can implement the changes needed for {{ja-readings}}. Benwing2 (talk) 21:36, 11 September 2023 (UTC)
@Benwing2 Alrighty, thanks for handling it. I'll figure out that business tomorrow if I can. Kiril kovachev (talkcontribs) 21:39, 11 September 2023 (UTC)
(Explanatory note for those who don't know Weblio) I would like to note that Weblio itself is not a dictionary but a search engine that find things from many dictionaries and encyclopedias, which includes (but not limited to) Japanese Wikipedia/Wiktionary etc. In other words, this can be understood as an online dictionary mirroring website. If there is something wrong on weblio, then we may need to check where such errors come from. MathXplore (talk) 06:46, 12 September 2023 (UTC)
I noticed that kanji.jitenon.jp is based on existing paper-based kanji dictionaries from major Japanese publishers (such as KADOKAWA etc.). Their references that they used are listed at . I wasn't able to find the list of editors from the company's website, but I think kanji.jitenon.jp is reasonable to be used for our upcoming checks. MathXplore (talk) 06:57, 12 September 2023 (UTC)
@MathXplore Right, I see, thanks for the correction, I think I'll just stick to directly tapping into KANJIDIC2 for the comparison, then. Also good to know about kanji.jitenon.jp — I noticed they were clearly different from the other net-based aggregators, so thanks for checking out the source. Kiril kovachev (talkcontribs) 09:22, 12 September 2023 (UTC)
@Kiril kovachev @Eirikr On a somewhat related note, the Unicode Consortium are currently undertaking a massive (read: years long) review of the Japanese data in the Unihan database, which is probably worth keeping an eye on. They’re well-aware that there are major inaccuracies in a lot of the data, but they also do want to clear them up - and they have some pretty knowledgeable people contributing to it. Theknightwho (talk) 01:15, 12 September 2023 (UTC)
@Theknightwho Wow, thanks for letting us know. That's certainly interesting — when did they start, do you know? We could benefit a lot, if we aren't able to handle everything before then ourselves, if we check out what changes they make when everything's finished. Kiril kovachev (talkcontribs) 09:19, 12 September 2023 (UTC)
I don't know well about the KANJIDIC and the related bot mentioned as above, but I think it's a good idea to ask about their sources. Then we can easily declare if they are valid or not. MathXplore (talk) 06:33, 12 September 2023 (UTC)
Sounds good. I'll send them a mail later today ^^ Kiril kovachev (talkcontribs) 09:22, 12 September 2023 (UTC)
@MathXplore, @Eirikr As you were interested, I asked Mr. Breen about where the kanjidic readings originally come from, + highlighting what anomalies we've found, and he responded:

Hi Kiril,

Thanks for making contact on this issue. This is just a brief initial response; l don't have a lot of time to address the matters you raised, and I'm heading off on some travel until late October.

When the kanjidic data was first put together over 30 years ago it was a matter of scraping together whatever was available. No attempt was made to look at sources other than basic references such as Nelson. That applied particularly to the rare kanji in the JIS X 0212 "supplementary" standard. For most of these the details were simply copied from the Unihan data. I see that most of the kanji you mentioned in the email are from JIS X 0212 and that the doubtful readings are from Unihan.

When I get a chance I'll go through them and check against some kanwa sources. I suspect that many of those readings can be dropped. As they are not common kanji this sort of checking has never been a high priority.

I hope this helps, and I'll get back to you.

Cheers

What I believe we should draw from this is, we have good license to question the odd readings, and should use some established kanwa sources to verify or reject specific readings. Kiril kovachev (talkcontribs) 16:14, 13 September 2023 (UTC)
@Kiril kovachev Thank you for doing this! This answers the question of where the info came from (Unihan) and I agree that we should drop or quarantine the doubtful readings. Benwing2 (talk) 19:32, 13 September 2023 (UTC)
@Benwing2 Agreed! And status update on the automated processing, I've been ever so slightly busy these two days, and so I've yet to finish it, but one of these coming days I hope I can get it done. Either way, this is good news imo for our goals :) Kiril kovachev (talkcontribs) 21:47, 13 September 2023 (UTC)
Thank you for the contact. Since we learned KANJIDIC is "scraping together whatever was available", I'm afraid that it may have collected exceptional readings (such as readings used only for human names, readings used when converting ancient Chinese texts to Japanese etc.). I agree that doubtful readings must be checked, but I also thought that we may need to give labels to rarely used readings to distinguish them from frequently used readings. MathXplore (talk) 12:35, 14 September 2023 (UTC)
Indeed, this is almost certainly the case. It's unfortunate that this makes it hard to discriminate the simply rare readings from the outright nonexistent or incorrect... Kiril kovachev (talkcontribs) 19:01, 20 September 2023 (UTC)
@Benwing2, @Eirikr, @MathXplore: as part of these fixes, would you all support making a template to show the Japanese name of radicals and/or kanji components? For example, 阝 (kozatohen). I think it would make more sense to say that this "kanji"'s "name" is kozatohen, rather than it being read as kozatohen; after all, it's not really a standalone kanji. Instead of putting this as a reading, therefore, we could put in the definition line something to the tune of "The kanji radical 阝, called こざとへん (kozatohen)" or the like, as we do for letters of the Latin alphabet etc.
Also, in other news, I believe I've finished the scraping business, and so I've got to set it going. Specifically Ben I hope to compile all the readings your script flagged up, and for each kanji provide a list of readings that can be removed.
For now I'm only processing kun, but principally I can have this extended to on as well. There's also the issue that on readings are also further sub-categorized, so I think this is a good sweep for now.
Kiril kovachev (talkcontribs) 19:08, 20 September 2023 (UTC)
@Kiril kovachev Thank you! Yes I agree with your suggestion, and the comparison to letters makes sense. Benwing2 (talk) 19:13, 20 September 2023 (UTC)
I support your suggestion, helping readers understand the components can help their learnings. MathXplore (talk) 07:11, 21 September 2023 (UTC)
@Kiril kovachev, I'll add a "me three" in support, alongside @Benwing2 and @MathXplore. 😄 ‑‑ Eiríkr Útlendi │Tala við mig 22:20, 29 September 2023 (UTC)

Labels in inflections of labeled terms

Hi. Currently I'm working on Fala entries, and I've just added WT:ACCEL support to the verb-conjugation template. This language has three varieties, one for each town where it is spoken. In order to not give preference to one variety over the other two, both the bibliography and the Wiktionary entries have labels (see enxagual).

My question is whether we should label non-lemmas if the parent entry is already labeled, either with {{lb}} or {{tlb}}. This doesn't only apply to Fala or verbs, but all Wiktionary languages, so I feel I should ask first for consensus. Cheers, sware🗣🏲 16:30, 10 September 2023 (UTC)

No, that kind of information should be centralised in a single lemma, unless it relates specifically to the inflected form. Alternative forms, where they're technically lemmas, are more of a grey area though it's usually still better to centralise the information at one entry, again unless it's specific to the form itself. —Al-Muqanna المقنع (talk) 16:47, 10 September 2023 (UTC)
+1, exactly. Like newmade is not a particularly "obsolete" past tense of newmake, it's just the past tense of newmake, and newmake as a whole is obsolete. Whereas, low#Etymology_2 is a {{lb|en|obsolete|nocat=1}} {{inflection of|en|laugh||sim|past}} (...actually the "nocat" is questionable, although I understand why it was added...) because it really is a specifically obsolete past tense form of laugh, which has the non-obsolete past tense laughed. - -sche (discuss) 17:18, 10 September 2023 (UTC)
Understood, thanks. The low case does appear in Fala, take the example of enxagual: the second-person plural imperfect indicative form enxaguabis only is found in Mañegu. No (simple) way I could encode that into the acceleration script, right? Anyways, that'd be more of a GP question. Thank you both. sware🗣🏲 17:34, 10 September 2023 (UTC)
It's been a while since I tinkered with any ACCEL stuff, but if the Mañegu-specific forms are predictable, e.g. if every verb generates two second-person plural imperfect indicative forms and the first one is always Mañegu-specific (or they both are, etc), then it should be possible to have the template/module code wrap that form's link in a different class so that ACCEL clicking on it can generate the label. It might make even more sense to just have a bot generate the forms. Presumably the table on enxagual should also have some indicator visible on that entry that that form is Mañegu-specific, e.g. a (qualifier) or a footnote! - -sche (discuss) 17:46, 10 September 2023 (UTC)
@Swaare User:-sche is right, you can encode the fact that the form enxaguabis is Mañegu-specific in the accelerator form (= class), and have the accelerator code generate the appropriate label. Benwing2 (talk) 19:22, 10 September 2023 (UTC)

Etymology-only Sardinian

Hi, everbody. A little while ago, I added the Sassarese entry abbaiddà, which is a borrowing from Sardinian abbaidare. There are some variants of the latter, but this specific form only appears in Logudorese Sardinian.
Now, my question is whether the Etymology subsection for abbaiddà should look like this

Borrowed from Sardinian abbaidare

or like this

Borrowed from Logudorese abbaidare

I mean, I personally can't see a reason to not use the more specific nomenclature, but I thought I'd ask. —— GianWiki (talk) 11:20, 11 September 2023 (UTC)

Logudorese is already a recognized etymology-only language, so it can be the second. Just write "Borrowed from {{bor|sdc|sc-src|abbaidare}}." —Mahāgaja · talk 11:29, 11 September 2023 (UTC)
Thank you very much for your answer. —— GianWiki (talk) 11:36, 11 September 2023 (UTC)

spelling of medi(a)eval in etymology sections

Should we stabilize the spelling of this word? e.g. on φασκιώνω we spell it Mediaeval and on παλαμάρι without the a. The shorter spelling is the one we use as the main entry (Medieval Greek), as does Wikipedia, but I didnt check for other languages such as Medieval French. This isnt a huge problem, but stabilizing it would make searches somewhat easier. Thanks, Soap 10:26, 12 September 2023 (UTC)

There aren't any languages whose canonical Wiktionary names include the word medi(a)eval, and only four etymology-only languages (CAT:Medieval Hebrew, CAT:Medieval Latin, CAT:Early Medieval Latin, and CAT:Medieval Sinhalese), all of which use the spelling medieval. As for Medieval Greek, etymology sections ought to be using the code gkm and call it Byzantine Greek. —Mahāgaja · talk 10:50, 12 September 2023 (UTC)
There is however a canonical label "mediaeval folklore" and "mediaeval" also appears as a qualifier in translations sections, derived terms etc. Search turns up 747 results. "Mediaeval" is a rather dated spelling even in the UK so I'd recommend getting rid of it and standardising to "medieval". It seems the etymology sections where it appears for Greek already are using gkm, they just tautologically specify "mediaeval Byzantine Greek". —Al-Muqanna المقنع (talk) 11:02, 12 September 2023 (UTC)
I just noticed that too. Certainly "Medi(a)eval Byzantine Greek" should be changed to simply "Byzantine Greek". —Mahāgaja · talk 11:11, 12 September 2023 (UTC)

Multiple quotations, same author, different works

I seem to remember once reading somewhere on Wiktionary about it being not recommended to provide multiple quotations by the same author for the same entry. Is there actually anything against this? —— GianWiki (talk) 10:27, 12 September 2023 (UTC)

It's not recommended because some authors have idiosyncrasies and we want to show full lexicalization, not just individualized lexicalization. If there are other quotes from other authors supporting a definition, it seems fine to include a quote from the same author if the quotes from said author particularly demonstrate usage well, however for a term to pass WT:CFI they cannot be from the same. Vininn126 (talk) 10:34, 12 September 2023 (UTC)
Sorry to bother you, @Geographyinitiative, @Vininn126, but I have another question.
If there is one work by an author, and another work which is an anthology of texts of various origins (i.e. not composed by the author of the first work), but edited by the author of the first work, do both works count for the purposes of attestation?
Thanks in advance —— GianWiki (talk) 08:44, 13 September 2023 (UTC)
@GianWiki What do you mean edited? Vininn126 (talk) 09:49, 13 September 2023 (UTC)
@Vininn126, I mainly meant to say collected. It's the case of a mid-19th-century anthology of Sassarese popular songs, collected by Giovanni Spano (author of several Sassarese translations from the Bible), who I believe also provided the orthography in which these songs have been published (which is the same one used in his Bible-related works).

Also, even though it's a bit off-topic—but since I'm here, I thought I'd ask—how does one deal with an anthology where most of the texts don't have a title or a named author (sometimes neither), but some do?
I'll use examples taken from the entry a.
This text has no title or author:
  • 19th century, unknown author, ; republished in chapter XV, in Giovanni Spano, editor, Canti popolari in dialetto sassarese, volume 2, Cagliari, 1873, page 87:
    Dunca lu megliu è
    Tu pensa a la to’ pazi, ed eju a me.
    So the best is: you think about your own peace, and I about myself.
whereas this text has no title, but has a mentioned author:
  • 19th century, Gavino Serra, ; republished in chapter XLII, in Giovanni Spano, editor, Canti popolari in dialetto sassarese, volume 2, Cagliari, 1873, page 129:
    Di tanti cantendi, e tanti
    Mancuna incantesi a me,
    Ma da ch’aggiu intesu a te
    Tu sei l’unica ch’incanti.
    Of so, so many singers, not one enchanted me; yet, since I've heard you, you're the only one who enchants.
Is there a more correct way of dealing with this, perhaps? ——— GianWiki (talk) 10:25, 13 September 2023 (UTC)
I'm not 100% sure but I'd be inclined to say that it might count. Vininn126 (talk) 10:28, 13 September 2023 (UTC)
Since Spano isn't the author of the works but merely the editor of the collection, I'd say these count as having different authors. Also, since Sassarese is an LDL, a single mention (let alone a use) is sufficient to pass CFI. —Mahāgaja · talk 10:37, 13 September 2023 (UTC)
@GianWiki: If these songs were not previously published in an edition then it'd be more correct, and probably simpler, to use |year_published= and the format in the edition. In this case I would simply write {{quote-book|sdc|year=c. 19th century|author=anonymous|chapter=|editor=w:Giovanni Spano|title=Canti popolari in dialetto sassarese|worklang=it,sdc|year_published=1873|section=song 15|page=87|pageurl=https://books.google.it/books?id=TWlcAAAAcAAJ&pg=PA129|...}}:
  • c. 19th century, anonymous author, “”, in Giovanni Spano, editor, Canti popolari in dialetto sassarese (overall work in Italian and Sassarese), published 1873, song 15, page 87:
And the same format for the Serra song. Note that anonymous is now explicitly accepted as a parameter for authors, editors, etc. In general though I don't think it's any different from works in any other edited collection. On the CFI point, we'd need to use common sense about whether the editor or the context of the collection would've influenced the actual composition of the works themselves, thus potentially making them non-independent. In this case I would say they count. —Al-Muqanna المقنع (talk) 10:41, 13 September 2023 (UTC)
Thank you very much for your answer. —— GianWiki (talk) 10:50, 13 September 2023 (UTC)
Things for you to consider GianWiki (some overlap with above): At this stage, we technically do not know who wrote the page 87 excerpt. In a pinch, I don't think we know that page 87 and page 129 have two different authors- (1) could Gavino Serra have written the page 87 exceprt without Giovanni Spano realizing it? I've never seen an RFV get that strict, but it may be warranted. Also, I would question: (2) is Gavino Serra the same person as Giovanni Spano (some kind of pseudonym) and Spano is writing both of these passages himself? Or perhaps, (3) in collecting/editing the page 87 and page 129 passages, did Giovanni Spano change the text or modify the wording such that he is really becoming the author? (4) A more remote possibility: could the page 87 and page 129 passages have been written within a year of each other? (See Wiktionary:Criteria for inclusion#Spanning at least a year) These are all things I would consider. Hence I personally would not rely on page 87 and page 129 as two of three cites to meet WT:ATTEST- I personally would find a fourth cite that was clearly independent, etc. That step is probably not necessary. But to me, adding cites is not about reaching a golden three cites for some dumb rule, it's about educating future readers about the actual usage of a word and incidentally hitting WT:ATTEST in the process. --Geographyinitiative (talk) 10:54, 13 September 2023 (UTC)
@Geographyinitiative, GianWiki: I think in this case, they're independent for the lemma and dependent for the spelling of the lemma. Does LDL-ness get grandfathered, or could entries be expunged simply because a language became well-documented? (This might become relevant for Lao-script Pali, where the spellings are 20th or 21st century but the composition often occurred over two millennia ago.) --RichardW57m (talk) 17:39, 13 September 2023 (UTC)
Here's an example of something that I guess may fail RFV for having only one author giving uses and therefore not passing Wiktionary:Criteria for inclusion#Independent: Citations:Tomosteng. --Geographyinitiative (talk) 10:47, 12 September 2023 (UTC) (Modified)
Thank you, @Vininn126 and @Geographyinitiative, for your answers. —— GianWiki (talk) 11:03, 12 September 2023 (UTC)

Could kennings be moved to the etymology section as a param under the compound template?

Just as we have Category:Ancient Greek bahuvrihi compounds auto-generated by the params of {{compound}} and its relatives, so too could we have Category:Old English kennings populated by a parameter of those same etymology templates. Since right now Category:Old English kennings has just two words, there isnt a lot of cleanup work that needs to be done to prepare for the new system. Likewise Category:Old Norse kennings has only eight. This seems like a good idea to me. Thoughts? Soap 10:48, 12 September 2023 (UTC)

Just a note that the required change to the template is language-indifferent, so it won't need to be manually hard-coded for OE, Old Norse, and whichever other languages use kennings. Soap 10:49, 12 September 2023 (UTC)
@Soap: I went ahead and made the change (|type=kenning or |type=ken), and also fixed the categories; feel free to clean up the entries. Benwing2 (talk) 05:23, 13 September 2023 (UTC)
Thank you so much. This works just like I'd hoped it would. There are a few words, such as hjǫrleikr, that suggest we might want to keep the old method open as a fallback for if only one of two senses of a word is considered a kenning, and two others that I haven't yet converted to the new system because the etymology sections are more wordy than the rest, but I changed all of the others right away and I think they present the information more neatly than they did before. Soap 09:50, 13 September 2023 (UTC)
Not all kennings are compounds, for instance alda bǫrn ‘children of generations ’ or glóða garmr ‘hound of embers ’. How should these be handled with the new format? ᛙᛆᚱᛐᛁᚿᛌᛆᛌProto-NorsingAsk me anything 11:53, 22 September 2023 (UTC)
@Soap ᛙᛆᚱᛐᛁᚿᛌᛆᛌProto-NorsingAsk me anything 12:17, 24 September 2023 (UTC)
you changed hjǫrleikr to use the etymology section, but there are other words that will still need to be categorized in the earlier way ... i dont think anyone here intends on turning off the functionality of the labels. Soap 04:22, 25 September 2023 (UTC)

Inherited derivation

If term X is derived from term Y in language B (mnemonic Before), and language A (mnemonic After) inherits, borrows or deduces all of X, Y and the derivation rule from B, may one still list 'Y' under 'Derived Terms' of X for language A? This arises in the context of listing terms derived from a root; some of the derivations we are recording may be very ancient. The wording of WT:EL is not crystal clear, and I have an unreliable recollection of it being argued that in such circumstances as this, X would not be derived from Y in language A. However, applying such an interpretation would be deleterious pedantry when applied to derivatives of roots. --RichardW57m (talk) 14:11, 12 September 2023 (UTC)

I'm having trouble following all the algebraic constants. Can you give a concrete example of what you're talking about? —Mahāgaja · talk 14:18, 12 September 2023 (UTC)
With the complexity that X and Y have changed, Sanskrit मति (máti,matí, noun) from Sanskrit मन् (man, root), although this example can be traced back via Indo-Iranian noun *matíš from root *man-, which in term comes from Proto-Indo-European *méntis from *men-. WT:EL prohibits English data as a derivative of English datum because both come from Latin, though as a mere plural of datum, data would not be added at datum because it is already in the inflection line, so it isn't clear as an example of over-ancient derivation. --RichardW57m (talk) 14:54, 12 September 2023 (UTC)
In this case, I would say that Sanskrit मति (mati) is inherited from Proto-Indo-Iranian *matíš, which is either derived or inherited from Proto-Indo-European *méntis (depending on whether you think it can be called inherited despite the accent shift and concomitant switch from e-grade to zero grade in the first syllable), but I would also add {{surf|sa|मन्|-ति}} to show that it hasn't lost its synchronic association with the verbal root. —Mahāgaja · talk 15:13, 12 September 2023 (UTC)
@Mahagaja: So should one remove मति (mati) from the 'Derived terms' section of मन् (man)? --RichardW57m (talk) 17:05, 13 September 2023 (UTC)
Not in my opinion, no; I believe the Derived terms section should include surface analyses even when the affixation first happened at an earlier stage. —Mahāgaja · talk 17:09, 13 September 2023 (UTC)
@Mahagaja: Good. I think that's more useful for the reader. --RichardW57m (talk) 17:23, 13 September 2023 (UTC)

should country-specific dialects count in Category:Languages of X?

I am converting {{dialectboiler}} invocations to use the new {{auto cat}} support. Some existing country-specific dialect pages are manually classified into Category:Languages of X. E.g. Category:Belizean English is manually categorized into Category:Languages of Belize and similarly for Category:Nigerian English and Category:Languages of Nigeria; but Category:Singapore English is not manually categorized into Category:Languages of Singapore. Meanwhile Category:English language itself is categorized into all of the above country-specific categories by virtue of the long list of countries specified in the call to {{auto cat}} in Category:English language. My instinct is to remove all the dialect categories from the "Languages of X" categories and only list languages; but it could be argued the other way. If we are to include such categories, IMO it should be done semi-automatically, e.g. the fact that the call to {{auto cat}} for Category:Nigerian English will have |1=Nigeria given and Category:Languages of Nigeria exists should be enough to auto-categorize Category:Nigerian English into Category:Languages of Nigeria. Benwing2 (talk) 07:04, 13 September 2023 (UTC)

Definitely. Completely sensible, especially since "Languages of " include dead languages or very small minority ones. Including dialects that are widely spoken gives a much better impression of the actual languages spoken in that place. —Justin (koavf)TCM 07:43, 13 September 2023 (UTC)
@Koavf Did you read the part where I mention that the languages themselves are already listed in these country-specific categories? Including them causes duplication between language and dialect. Benwing2 (talk) 08:04, 13 September 2023 (UTC)
I think both should be in the country category. I'm fine with both CAT:English language and CAT:American English being in CAT:Languages of the United States. It doesn't feel redundant, it feels thorough, especially since CAT:American English only lists words that are uniquely American (or North American) but not everyday words that are found in all dialects of English. —Mahāgaja · talk 08:14, 13 September 2023 (UTC)
Yes, I'm not sure how my response was itself unclear. I agree with the semi-automated inclusion via {{auto cat}} or {{dialectboiler}} or some other systematic way of inclusion. The solution to this double-counting problem could just as easily be 1.) it's not a problem, just leave them or 2.) only include the country of origin in the language itself and have dialects at the countries. Either one could be reasonably argued: I'm just trying to answer the question you asked without introducing noise. —Justin (koavf)TCM 08:24, 13 September 2023 (UTC)
@Koavf: That's worrying. I couldn't work out which proposition 'definitely' was meant to agree with. --RichardW57m (talk) 17:21, 13 September 2023 (UTC)
The thread has a yes or no question in its title, so I answered it with a "(yes) definitely". When someone asks an explicit question, I try to actually answer the question asked. —Justin (koavf)TCM 18:21, 13 September 2023 (UTC)
Seems reasonable. Vininn126 (talk) 17:15, 13 September 2023 (UTC)

Shavian-alphabet English entries

An IP has been making entries for English words in the Shavian alphabet, a constructed alphabet for English with negligible currency outside of some hobbyists, e.g. 𐑮𐑦𐑕𐑐𐑧𐑒𐑑 (respect). Do we want these? Do they need to be RFV'd with three cites? Or is a constructed script like a conlang, only deserving entries if it has stable native/professional use? We don't have any English entries in the Deseret alphabet. Shavian and Deseret have come up a few times before in various contexts (e.g. Template talk:deseret) but this might be the first time anyone's bothered systematically creating entries. —Al-Muqanna المقنع (talk) 00:37, 14 September 2023 (UTC)

If a word in the Shavian alphabet can pass CFI, then it stays. That's the way it should work, in my book. Helps us have the most-used Shavian-script words, without the flood of nonce Shavians/Deserets. CitationsFreak (talk) 01:23, 14 September 2023 (UTC)
I agree. Shavian and Deseret spellings need three cites showing usage (not merely mentions) from three independent sources over more than a year. Otherwise, they get deleted. —Mahāgaja · talk 06:03, 14 September 2023 (UTC)
If we are to have any entries in these scripts at all, we need to have a concept of "ConScript" (see also Constructed writing system on Wikipedia), have script codes for these scripts and CSS entries so they are displayed correctly, and ensure that they are correctly detected at the language level. This is significant and non-trivial technical work, and for this reason I think it's better to ban them entirely unless we're willing to bite the bullet and do the implementation work. Otherwise we just end up with bad-looking characters that are broken on many people's browsers. Benwing2 (talk) 06:08, 14 September 2023 (UTC)
I think most scripts whose characters get encoded into Unicode (like Shavian and Deseret) also get ISO script codes (in this case, Shaw and Dsrt, which are in our module). I don't think we can or should have entries in any scripts that aren't encoded into Unicode. - -sche (discuss) 06:53, 14 September 2023 (UTC)
This is an interesting question. It's tempting to think "include any script (e.g. Shavian) and any spelling in that script (e.g. 𐑮𐑦𐑕𐑐𐑧𐑒𐑑) if three authors have used it", but if three authors publish books (or webpages) written entirely λικε ←that or with леттерс ←like those, would we add Greek- and Cyrillic-script forms of all the English words the books have in common? If Greek bloggers tweet in Latin script, do we include that? Or, as you ask, should we require some evidence that there is a real community of native, natural users, like for e.g. historical Cyrillic Romanian? (Is the standard even higher than that? There are some naturally-used script-forms we—IMO startlingly and inconveniently—don't include, like Latin-script Yiddish.) Does the fact that this script was specifically invented for English make it more acceptable, or does the fact that it was specifically invented (within living memory) like a conlang make it less includable? If it weren't a separate script but just a new (Latin-script) orthography, and a similar-size group of hobbyists published writings all juzing thair nu speling, I feel like we'd include that with the relevant label(s). (So I'm not sure what my answer to the question is, yet.) - -sche (discuss) 07:06, 14 September 2023 (UTC)
Such Latin-script reform experiments have happened—"American spellings" were the result of a successful one—and it might be possible to find 3 distinct sources written in something like SR1. What does "independent" (per the CFI) mean in this context, though? Apparently a journal was published in the Shavian alphabet which went through a number of issues, would three different articles there count as independent even though they're all edited by the same group? @MahagajaAl-Muqanna المقنع (talk) 07:46, 14 September 2023 (UTC)
If we were talking about spellings (juzing for using in The Journal Of Spelling "Juzing" Like That) or words (say competitive eaters writing in Competitive Eating Quarterly all use gurgitize to mean "eat", but no-one else does), I'd say if they're written and edited by three different people, they're includable, and the people being in one small group is something to note in a label or usage note. (We have to be careful not to present rare terms as widespread, and we have to figure out how to handle the fact that e.g. SR1 spellings are all better-attested, since before the advent of SR1, as simple eye dialect, but I don't see any reason or way to exclude them.) If they were written by different people but edited by the same person, I'd consider context: maybe an American science journal editor would enforce a house style and silently change both a London doctor's foetusise an Oxford scholar's foetusize to fetusize, so we might not consider those independent if they were edited by the same person, but if we're talking about The Journal Of SR1, or Shavian Monthly, I think we can assume that anyone who'd contribute work to there would've written in SR1 / Shavian to begin with. The only reason I see to treat Shavian-script forms different from SR1 or juzing is that it's an entirely separate script, and a "conscript" at that... but IDK, maybe we should just subject that to normal ATTEST requirements... - -sche (discuss) 15:43, 14 September 2023 (UTC)
  • Leaning exclude. The Shavian alphabet was designed to be phonetic, so IMO it's equivalent to me creating an entry at ɹɪˈspɛkt for respect. Even if it's citable, is it even English? — excarnateSojourner (talk · contrib) 18:25, 15 September 2023 (UTC)
    It's English, but we can vote to exclude it, just as we exclude the common Roman script forms of Sanskrit or any Roman script form of Thai. Wiktionary is not a tool for decoding Roman script Thai television listings. --RichardW57 (talk) 04:30, 18 September 2023 (UTC)
    User:ExcarnateSojourner's comment about ] reminds me that the Journal of the International Phonetic Association used to print its articles entirely in IPA, which means there are probably hundreds of words attestable from dozens of authors using spellings just like ] in both French and English. I doubt there's any will here to include those, either. —Mahāgaja · talk 06:47, 18 September 2023 (UTC)
    @Mahagaja: Not just the Journal of the IPA, various other things have been published in IPA too (like this edition of Alice in Wonderland). The early history of IPA was bound up with the spelling reform movement and I don't think it was originally even conceived as cross-linguistic. So on the face of it IPA could have as good a claim as Shavian. —Al-Muqanna المقنع (talk) 09:09, 18 September 2023 (UTC)
    While there is a theoretical interest to include IPA English seen like Pīnyīn, to directly look up corresponding terms for phone sequences one hears, this stuff could as well be auto-generated when pronunciation section coverage is comprehensive, and it is apt to reject any such “alternative spellings” because we must prevent people from engaging in vanity entries, to save them opportunity and the dear other editors maintenance costs. There is a good case for an oblocution anent such creators that their close ones should be embarrassed for not preventing them from engaging in so uncreative an emprise. To break it down: Is it purposeful or is it stimming? No, we can’t even check such entries for their use, it’s too stupid and we will nuke them anyway if fed up by their quantitative easing. You’ll just have not been offended enough by something you can’t distinguish from the work of an automated tool. In this fashion, there was once, as I remember, Wonderfool adding heaps of hyphenated surnames or something similar, which discussants diligently tried to match with the words of inclusion criteria, until Equinox blocked this and made him due something more useful, because IYKYK, but one can discern an a fortiori rule thesis: If an editing pattern performed by a human is an automaton and unauthorized, then it can be curbed just like an authorized bot. Bot-like actions have little merit dissadvising reversion, corresponding to the little thought that went into them. Don’t forget that there are other inclusion requirements than including everything everywhere that has three attestations. There is definitely something about appraising the efforts of our editorbase and protecting it thereby, finding also expression in the ban of unauthorized bots, the sense of which is preventing human editors being overwhelmed. Although indeed there is some oratory exercise needed to make lost IPs understand why there is consensus that we can’t suffer these things—the cost of lacking casuistry, which would be more easy to point at but, as I have expanded upon on another occasion, each step of which diminishes recognition of general reason owing to its arbitrary bulk. Fay Freak (talk) 11:52, 18 September 2023 (UTC)
    Honestly, I think there would be some interest in words written in IPA that appear in this script even in works that don't use it. For example, if there were three books that said "John had ɹɪˈspɛkt for SR1" or something like that, I would say we should include the bolded term, but not "hæd ɹɪˈspɛkt". (If this proposal means there's no words in alt scripts, so be it.) CitationsFreak (talk) 22:59, 18 September 2023 (UTC)
    @Mahagaja, CitationsFreak: While there is little need for an entry for ɹɪˈspɛkt, as a Wiktionary search will find it, the 6th edition (1944) of Danial Jones' an English Pronouncing Dictionary has risˈpekt, which the Wiktionary search only finds in English 𐑮𐑦𐑕𐑐𐑧𐑒𐑑! While these variations may be automatable, the CFI will probably exclude most. --RichardW57m (talk) 12:13, 19 September 2023 (UTC)
    • @RichardW57 If someone created a writing system for English using logograms and it was sufficiently used by language enthusiasts, would that still be English? If someone developed a sign language that mapped English words one-to-one with signs, would that be English? — excarnateSojourner (talk · contrib) 18:45, 18 September 2023 (UTC)
      @ExcarnateSojourner: Yes, they would be. You just wrote 'for English'. To go back to first principles, some one might encounter such writing and want to know what it meant. In the second case, we would just have alternative forms. A much more difficult case would be ∃? (Does it exist?"), which is common in my jottings, but might be translingual or purely idiosyncratic, and could probably be dismissed as SoP.
      These are where the idea of CFI demanding occurrences in books comes in - such usages would not be commercially viable. But note that English in IPA would make the cut, as did English in Deseret. --RichardW57m (talk) 11:32, 19 September 2023 (UTC)
  • FWIW, I'd prefer that we not allow these as entries. The discussion does contain interesting arguments, though. DCDuring (talk) 19:04, 18 September 2023 (UTC)

Translation of a translation (of a translation etc.)

Hello everyone. How do you deal—in terms of quotes—with a translation of a translation?
I'm using the Sassarese translations of several books of the Bible as sources for quotes. These translations are based on an 18th-century Italian translation of the Latin 1592 Clementine Vulgate—which is based on Vetus Latina texts (which, in turn, are based on the Ancient Greek Septuagint, translated from Hebrew texts). This is an example of how I'm currently handling it:

  • 1863 , Antonio Martini, chapter I, in Giovanni Spano, transl., Lu càntiggu de li càntigghi di Salamoni, London, translation of Il cantico de' cantici (in Italian), verse 14, page 6:
    Tu sei veramenti bedda, o amigga meja, veramenti bedda: l’ occi toi sò di culombi.
    You are very beautiful, o lover of mine, very beautiful. Your eyes are those of doves.

My question is: should one be concerned with... I don't know, doing something in order to mention that a translated work is based on another translated work?
I realize this is probably an extreme example, but I'm curious to know if there's any point in even considering this.
Thanks in advance. —— GianWiki (talk) 10:39, 14 September 2023 (UTC)

For the Bible I definitely don't think you should bother putting original dates and languages and so forth. See {{RQ:King James Version}} for example. You've already indicated it's a translation from an Italian version, which is enough. —Al-Muqanna المقنع (talk) 10:57, 14 September 2023 (UTC)
I see. Thank you very much for your answer.
GianWiki (talk) 12:49, 14 September 2023 (UTC)
Yeah, in the case of the Bible, I think people know it wasn't originally written in Italian, or King James' English, etc (so they should know the Italian edition is itself translating another language). If there were some exceptional situation where it was necessary to go 'another step backwards' to show something relevant about the word being quoted — e.g. the Sassarese edition of something uses the word under discussion in one place but then uses a different word later in the quote, whereas the Italian version uses the same word twice, but the Latin/Greek/etc does use two different words, which the Sassarese edition is reflecting — I think we could resort to some exceptional solution like tagging each of the Italian words with {{transterm}}. - -sche (discuss) 15:03, 14 September 2023 (UTC)
Using |author= or |origXYZ= is improper since the 1770s Italian translation isn't an "original" of anything. Just ignore all of the intermediate steps.
  • 1863, Giovanni Spano, transl., Lu càntiggu de li càntigghi di Salamoni , London: Impensis Ludovici Lugiani Bonaparte, chapter 1, verse 14, page 6:
    Tu sei veramenti bedda, o amigga meja, veramenti bedda: l’ occi toi sò di culombi.
    You are very beautiful, o lover of mine, very beautiful. Your eyes are those of doves.
Ioaxxere (talk) 17:00, 14 September 2023 (UTC)
When I cite the Bible in Gothic entries, I usually use the King James Version in the translation, since both the Gothic and the KJV are fairly literal translations of the Greek, which means they match up pretty well with each other (but I do deviate from KJV when it really doesn't match the Gothic). In this case, you could use Douay-Rheims, which is also a translation of the Vulgate and thus probably matches up pretty well with the Sassarese Bible. And it's out of copyright, so you don't have to worry about using too much of it. —Mahāgaja · talk 20:35, 14 September 2023 (UTC)
@GianWiki FWIW, in the documentation of {{quote-book}} there's an example of a translation of a translation. There's also the |origtext= param for indicating the original text in circumstances like these. Benwing2 (talk) 06:36, 15 September 2023 (UTC)

How to quote a book containing translations of various poems from various authors?

Hi. How do you quote a book where each chapter is a translation of a poem (with works from 5 different authors translated throughout the entire book)? I don't think using |deriv=translation and |original= cuts it, because that refers to the whole quoted book as a translation of a single work, and not single chapters. —— GianWiki (talk) 16:34, 14 September 2023 (UTC)

@GianWiki Are you somehow quoting the "whole book" or a single chapter? If the latter case, there are standard ways of doing this, e.g. |chapter_tlr= for a chapter translator and |author= for the chapter original author. Benwing2 (talk) 06:34, 15 September 2023 (UTC)
I also thought about doing something like this
  • xxxx , Poem translator, transl., “Translated poem”, in Anthology, translation of Original poem by Original poem's author:
    Quoted text
    English translation of quoted text
but it indicates Anthology—instead of “Translated poem”—as a translation of Original poem (maybe there is just no better way of doing it, since you can't specify when a single chapter is a translation on its own). —— GianWiki (talk) 07:07, 15 September 2023 (UTC)

Should Wiktionary be the "best free & open dictionary" or the "best dictionary"?

I ask this because the current header for the Grease Pit reads "the best free and open online dictionary". I believe that this should read "the best dictionary", since we should aspire to the best dictionary of all time, and not merely the best free one. (Yes, I know that removing that means that some schmuck may think that means shoving it behind a paywall and putting in ads is what we want, but that is not making it the best dictionary anyway.) CitationsFreak (talk) 01:13, 15 September 2023 (UTC)

I think "free and open" is absolutely central to the philosophy and ethics of this project. Equinox 01:22, 15 September 2023 (UTC)
Absolutely. I think it is an important thing what we remain free and open. However, it just sounds like there are dictionaries that are not free/open which are better than us, and we're fine with that. Which we shouldn't be. CitationsFreak (talk) 01:27, 15 September 2023 (UTC)
If it's just at the Grease Pit, I think it's kind of inside baseball. How many readers who are unfamiliar with the ethic of the project are ever going to arrive at that page? bd2412 T 02:40, 15 September 2023 (UTC)
"It is also a place to think in non-technical ways about how to make the best free and open online dictionary of “all words in all languages”." seems wrong. I would have thought that BP is the place for such policy matters, just as GP, not a user page, is the right place to discuss larger technical matters that impinge on and implement policies. I thought the basic principle would be to have discussions in forums that are inclusive ones, with audiences as large as practical of those affected by the matters discussed. DCDuring (talk) 12:11, 15 September 2023 (UTC)
Essential to me to mention these aspects. --Geographyinitiative (talk) 12:33, 15 September 2023 (UTC)

Thai Mon vs. Mon

@-sche In Oct 2022, User:Octahedron80 created a new language called "Thai Mon". We also have the Mon language. There is no mention of a Thai Mon language in Wikipedia; the closest is a paragraph in the Mon language entry that says "Thai Mon has some differences from the Burmese dialects of Mon, but they are mutually intelligible. The Thai varieties of Mon are considered "severely endangered."". We also have a Category:Thai Mon category, which is supposed to represent the Thai dialect of Mon and is added by the 'Thailand' label. I'm extremely skeptical that we need a separate Thai Mon language. I'm not sure if there was any discussion that led to this split or if it was just a "be bold" moment. I suggest we delete Thai Mon and move the entries to Mon, with the 'Thailand' label. Benwing2 (talk) 06:32, 15 September 2023 (UTC)

It is hard to say. Firstly, I had the same thought to have Myanmar/Thailand tags to specify which dialect, adding to pronunciations and definitions. Until the (Myanmar) Mon person, Intubesa, joined the game. He did not know what we were doing and he was overconfident; he claimed that Thailandish Mon (Thai Mon) words spelled wrongly, and he said that Wiktionary must use only Myanmar forms, because he relied on the ancient texts he worked. I knew dialects could read/spell differently per location, that is the nature of language, so I had to get some Thailandish references and collected alternative forms for a term. But he did not accept and also argued that the references are wrong either! (even though they were written by true Mons or experts). Then the edit war happened: if I wrote a Thai form, it would be renamed to corresponding Myanmar form (and the Thai form was neglected). The arguement also spreaded to other innocent users. Nevertheless, before he got banned, he suggested to split Thai Mon out of Myanmar Mon so they would not mix up. I agreed with this idea because there are a lot of words that read/spelled different much enough to split. The two Mons also have different grammar e.g. Thai Mon uses noun+modifier whereas Myanmar Mon uses modifier+noun. The paper "The Mon language: recipient and donor between Burmese and Thai" will explain this situation. I also use the concept at Thai Wiktionary and he never get in the way again. --Octahedron80 (talk) 14:47, 15 September 2023 (UTC)
If Thai Mon cannot be the "particular language" so the category "Thai Mon" could make use of dialectboiler to make a local portal for the dialect, like English, Spanish, Protuguese, etc. Unfortunately, here don't accept the term "Thailandish"; just "Thai" then might be ambigous to the Thai language in automatic categorization. --Octahedron80 (talk) 14:57, 15 September 2023 (UTC)
@Octahedron80 Thanks for the response. We can create language-specific labels so there's no ambiguity with a label like 'Thailand' applied to Mon entries. If you don't mind I will merge them. In general, when you have issues with someone like this, the correct thing is to get them banned rather than splitting a language. Benwing2 (talk) 18:28, 15 September 2023 (UTC)
@Benwing2: That didn't seem to work - Wiktionary:Beer parlour/2022/October#Thai Mon --RichardW57 (talk) 23:57, 17 September 2023 (UTC)
@RichardW57 I'll fix this myself once the lemmas have been moved back to being Mon lemmas. If you can move the lemmas under Mon and tag them with the Thailand label (which already exists), they will be correctly classified under Category:Thai Mon. There are only 23 lemmas so it shouldn't take too much time. Benwing2 (talk) 01:43, 18 September 2023 (UTC)
The Thai script lemmas are a problem. The test case is Thai Mon กะนิบ, whose existence is challenged at WT:RFVN#กะนิบ. Indeed, many of them could, it seems, be summarily deleted not merely for not existing, but for not even having any mentions mentioned. I'm not motivated to do work on them that will simply be deleted. I could and I suppose I should try looking for attested Burmese script forms. I'm not sure how to quickly check for a Burmese-script form recently used in Thailand, though. My attempt to buy a copy of a Thailand Mon dictionary almost resulted in us being scammed of the demanded purchase price. Such items seem to rapidly go out of stock. I'm not sure whether my not receiving a copy of Shorto's dictionary was a scam or not, but that wouldn't have been down to a Thai scammer. --RichardW57m (talk) 11:33, 18 September 2023 (UTC)
The Burmese script lemmas will also need work. For example, the Mon တ္ကံ and Thai Mon တ္ကံ can be merged, but the pronunciation may need some work, as I'm sure the currently given pronunciation from Thailand is only applicable to the spellings with the consonant stack. The data from Thai Mon ตะเกาะ will have to be merged in, because the principal form for 'Mon Thai' is in the Thai script, whereas even if the Thai script forms survive, the principal form will be in the Burmese script. --RichardW57m (talk) 11:33, 18 September 2023 (UTC)
@Benwing2: What didn't work was that blocking Intobesa didn't forestall the creation of the Wiktionary Mon Thai language. --RichardW57m (talk) 11:35, 18 September 2023 (UTC)
@Benwing2 Is there any reason not to delete the labels 'Myanmar' and 'Thailand' on a Mon item when it has both? --RichardW57m (talk) 16:00, 18 September 2023 (UTC)
@RichardW57 IMO, no particular reason to have both Myanmar and Thailand labels. The intention of those labels is to indicate terms that are restricted to specific dialects; if you add them when the term is attested everywhere, it defeats the purpose. As for the bogus lemmas, we should give User:Octahedron80 time to respond and then just delete them. I'm not sure about the pronunciation issues, that is outside of my domain of expertise. Benwing2 (talk) 18:40, 18 September 2023 (UTC)
@Benwing2 I had wondered if the bot operation you mentioned would depend on the labels to find the terms to operate on. --RichardW57m (talk) 09:12, 19 September 2023 (UTC)
He's had nearly 11 months already! Perhaps I should RfV them all, grouping them in WT:RFVN. I think I will split them into two groups, those given a reference and those not. --RichardW57m (talk) 09:12, 19 September 2023 (UTC)
The pronunciation issue is a mix of spellings and pronunciation which all belong to the same lemma, but with not all pairs of spelling and pronunciation going together. I'll raise the question in a more general form. I was just giving a warning of the complexity. Mind you, it's the job that never gets started that takes longest to complete. --RichardW57m (talk) 09:12, 19 September 2023 (UTC)

-less, -lessly, and -lessness

Could we have a bot or a script do the following things:

  1. Add a derived terms section to any page ending in -less to the corresponding -lessly and -lessness pages, if they exist;
  2. Tag the -lessly and -lessness pages as {{lb|en|rare}} if they do not have cites or at least a usex (e.g. embryolessness is not a common word, since we most commonly would say "lack of embryos" or the like).

I consider this a low-priority task and won't be upset if we decide it's not worth writing the code and running the script. Some of these words might not even pass RFV, but I don't want to see 1,200 RFV's or or even a mass-RFV/D that would delete all the hard work we've put in unless someone cracks the books looking for three cites for every single word. Soap 14:57, 15 September 2023 (UTC)

I don't think it's very safe to assume that such a form is rare if it doesn't have uxes or cites. For example, a very common word homelessness has neither. It's often just that they don't get a lot of attention because there is usually not much to talk about there as they just often say "the state of being X". lattermint (talk) 15:04, 15 September 2023 (UTC)
Would definitely oppose auto-tagging "rare" as well, though the first request seems reasonable enough. I can see a good number of attestations of embryolessness in technical literature as well so at most it would be "uncommon". —Al-Muqanna المقنع (talk) 15:07, 15 September 2023 (UTC)
  • While we're there, can someone please write a script to link to all of these guys, thereby saving me a massive amount of time? Jewle V (talk) 12:32, 16 September 2023 (UTC)
    I appreciate all of the work you've put into this linking project, which you've done almost single-handedly. To be honest my first reaction to this post was that you've brought us almost all the way to completion and I wondered why you'd want to give it up now, but it looks like even what's left is still quite a lot of work. It's also possible that the few remaining pages are longer or more complicated than what you've already run through.
    Anyway, I agree a script would be an ideal solution, since only a script could check to see which terms on the list have already been added in the meantime and therefore refrain from adding them again. One thing, though .... there would be just a tiny number of false positives if we made it fully automated, such as potentially linking wi-fi as derived from wi (which I fixed just now) and linking yu toy from toy (which we actually did earlier). Maybe we could go over the list before we run the script and remove anything obvious so that we don't have to deal with it later. (I'd think the "do it first and clean up later" approach would work too, but sometimes people don't want to do the clean-up phase). Best regards, Soap 04:02, 17 September 2023 (UTC)

Appendix Part of Speech Templates

Currently, this is what one must do on an appendix page to get it to function correctly: ((en-noun|((apdx-l|plural)))) (replace parentheses with curly brackets), which is more complicated than it needs to be. Templates should automatically link to an appendix page if within an appendix. Otherwise, they will function as normal in the main namespace. Netizen3102 (talk) 19:42, 15 September 2023 (UTC)

Does this really need to be a formal vote? Has this already been brought up in Grease Pit? AG202 (talk) 20:09, 15 September 2023 (UTC)
Yeah, it has been brought up, 13 years ago. Some appendix words may benefit from having their own pages. Netizen3102 (talk) 21:30, 15 September 2023 (UTC)

mismatch between Proto-Foo-Romance and Romance families

@Nicodene, -sche We have a serious mismatch between the Romance classification as found in the families in Module:families/data and the proto-languages found as subcategories of Category:Vulgar Latin. Trying to match them up we have:

  1. Category:Proto-Balkan-Romance approximately lines up with Category:Eastern Romance languages but the naming should be harmonized.
  2. Category:Proto-Italo-Western-Romance: No match. This includes part of Category:Italo-Dalmatian languages, plus Category:Gallo-Italic languages, Category:Occitano-Romance languages, Category:Oïl languages, Category:Rhaeto-Romance languages, Category:West Iberian languages and two languages hanging out by themselves: Category:Franco-Provençal language and Category:Venetan language.
  3. Category:Proto-Italo-Romance: The closest corresponding family is Category:Italo-Dalmatian languages, but they don't line up.
  4. Category:Proto-Western-Romance: No match. This includes Category:Occitano-Romance languages, Category:Oïl languages, Category:Rhaeto-Romance languages, Category:West Iberian languages, the unattached Category:Franco-Provençal language and possibly some or all of Category:Gallo-Italic languages.
  5. Category:Proto-Gallo-Romance: No match. This presumably includes Category:Gallo-Italic languages and Category:Oïl languages.
  6. Category:Proto-Ibero-Romance approximately lines up with Category:West Iberian languages. ("West Iberian" is confusing as it corresponds to all of Iberia except Catalan, and in no way, shape or form corresponds to western Iberia, which would approximately be Galician and Portuguese. I'm not sure if this naming is intended to exclude Catalan, or to disambiguate the family from the other Iberia in the country of Georgia, or to disambiguate the family from the ancient Iberian language.)

I would assume we also need a Proto-Rhaeto-Romance category. Benwing2 (talk) 01:30, 16 September 2023 (UTC)

Hi @Benwing2.
Category:Vulgar Latin reflects the classic branching model found in the Romance linguistic literature. In effect Proto-Romance splits into Proto-Italo-Western Romance, Proto Balkan Romance, and *Proto Insular Romance (> Sardinian). Proto-Italo-Western subsequently splits into Italo-Romance and Western Romance. The latter in turn splits into Gallo- and Ibero-Romance, plus whatever one wishes to call the Western Romance varieties of Northern Italy, Switzerland, and part of Croatia. It is a bit difficult to represent this branching on Wiktionary, however.
In any case the classifications in Module:families/data are indeed in need of revision.
The bad news is that there is no comprehensive modern classification scheme devised by specialists in Romance linguistics, that I am aware of. The good news is that this may be set to change next year with the release of the Manual of Classification and Typology of the Romance Languages (per De Gruyter, so easily accessible to us). Nicodene (talk) 03:37, 16 September 2023 (UTC)
Non-Balkan Romance was long a dialect continuum, so dialect groups should be expected to overlap. --RichardW57 (talk) 23:44, 17 September 2023 (UTC)
Perhaps worth mentioning that this matter was previously discussed here. It should be noted, however, that while Koryakov may be a linguist, he is not a specialist in Romance. Nicodene (talk) 04:11, 16 September 2023 (UTC)

Can hyponyms be included in Derived Terms?

"Postman" is both a derived term of "man" (it's post + man) and a hyponym of "man" (because a postman is a type of man, i.e. one who delivers the mail). Please look at this: . @Whoop whoop pull up says that we mustn't include derived terms if they are also hyponyms. I believe this is wrong. Can someone show me policy? Thanks. (-sche will be delighted to see me arguing to keep trans woman as a hyponym of woman. I only care about the words... mostly.) Equinox 01:53, 16 September 2023 (UTC)

My comment was based on the bit in the derterms section of woman that explicitly says that that section's specifically for derterms that aren't also hyponyms. Whoop whoop pull up Bitching Betty ⚧️ Averted crashes 02:34, 16 September 2023 (UTC)
Sorry, what is "the bit"? Some kind of HTML comment in there that I didn't spot? Does this override policy, and/or common sense? Equinox 06:25, 16 September 2023 (UTC)
The note at the beginning of the derterms section saying "Derived terms of woman without hyponyms" (emphasis added). Whoop whoop pull up Bitching Betty ⚧️ Averted crashes 12:57, 17 September 2023 (UTC)
Policy, shmolicy. There is some lunacy here. Would occupational hyponyms appear once for the adult-male-only definition and again for the adult-human definition? Should any role, occupational or other taken on by an adult male appear under Hyponyms. Why not each given name, surname, nickname, ethnonym? Hyponyms should be useful, not complete, provided we are still trying to help humans. DCDuring (talk) 02:36, 16 September 2023 (UTC)
If something is both a derived term and a hyponym, I don't think there's anything per-se incorrect about including it in both places, although in the case of occupations I will second DCDuring's comment that it would seem ... non-useful ... to list every possible _man occupation (postman, milkman, fireman, mailman, bagman, salesman, ...) as a hyponym of man, let alone to list them all in both places; I think we need some sanity check for when some class of hyponyms contains too many terms for it to be reasonable to list them all, even though the border will probably be a bit fuzzy. (Listing, say, "crone" as a hyponym of "woman" seems OK. Listing "girl" is probably OK, people use "woman" in an age-nonspecific way enough. Listing "trans woman" and "woman-born woman", fine. Listing "saleswoman", "spokeswoman", "firewoman", "mailwoman", and a zillion others: now we're getting into territory that seems less useful.) - -sche (discuss) 13:55, 16 September 2023 (UTC)
For postman another facet is that the listed derivation is from the suffix -man and not the word "man". Adding those terms to the derived terms section of man would just be manually reduplicating Category:English terms suffixed with -man. IMO occupational terms ending -woman should really be treated the same way with a suffix and an automatically populated category rather than the huge manual list currently at woman. I'm also reminded that we have similar quixotic hyponym lists lurking in the thesaurus like the comprehensive list of world countries included as hyponyms at Thesaurus:country. —Al-Muqanna المقنع (talk) 23:20, 16 September 2023 (UTC)
As a point of possible historical interest, when faced with the prospect of creating categories for terms used in compounds, people refused. But this was before we had fully implemented {{suffixsee}} and {{prefixsee}}, so the outcome may well be different now. Another possible explanation for this idea in the past may have more to do with the singer than the song. DCDuring (talk) 23:26, 16 September 2023 (UTC)
In general, I would say yes. All derived terms should appear in "Derived Terms". However, in this specific case, I'd say that "postman" is a derived term of "-man". CitationsFreak (talk) 15:44, 17 September 2023 (UTC)
  • As the main adder of multiword Derived terms in English entries over the last year, I sometimes separate them and sort them, sometimes bung them all together under Derived terms, and sometimes put them in both places. Ideally a bot would do the bulk of that work, but many of them need a skilled human eye to sort them into the correct place. In absence of skilled human eye, yours truly has done it. Jewle V (talk) 21:40, 17 September 2023 (UTC)

finally eliminating Prakrit languages

(Notifying AryamanA, Kutchkutch, Bhagadatta, Inqilābī, Msasag, Svartava, RichardW57): For over a couple of years now we've had a clash between Prakrit varieties as full languages and as etym varieties of the "Prakrit language", with two different codes for each variety. From what I gather, the intent was to switch to etym varieties of a single Prakrit language, but this was never completed. I notice that there are no remaining lemmas of the Paisaci Prakrit, Khasa Prakrit, Magadhi Prakrit and Ardhamagadhi Prakrit languages. I have already deleted the Paisaci Prakrit (inc-psc) and Khasa Prakrit (inc-kha) languages and switched all remaining references to use the equivalent etym codes (inc-psi and inc-khs). I am going to do the same for Magadhi and Ardhamagadhi. However, there are 3 lemmas in Category:Maharastri Prakrit lemmas and 14 lemmas in Category:Sauraseni Prakrit lemmas. We need to get rid of those (i.e. convert to Prakrit lemmas) before eliminating them. Can one or more of you help with this? This should be easily doable by hand. Benwing2 (talk) 04:02, 16 September 2023 (UTC)

I'm confused. Should the etymology codes appear in the 'Descendants' section? The links for them in the 'Descendants' आम्र (āmra) are orange rather than pointing to Prakrit 𑀅𑀁𑀩 (aṃba)? --RichardW57 (talk) 05:22, 16 September 2023 (UTC)
@RichardW57 Are you referring specifically to the Maharastri and Sauraseni Prakrit descendants? This is currently messed up due to the two codes with the same name and using the codes for the full languages. Once we switch them over to use the etymology codes (and maybe it's also necessary to delete the corresponding full languages), this will get fixed. Benwing2 (talk) 05:37, 16 September 2023 (UTC)
BTW I can do this automatically by bot; give me a little while and it will be done. Benwing2 (talk) 05:38, 16 September 2023 (UTC)
I'm not sure what 'this' and 'it' are. I've converted the Maharashtri Prakrit lemmas. --RichardW57 (talk) 05:49, 16 September 2023 (UTC)
@RichardW57 Thank you for doing that! What I meant is, I can change by bot the full-language codes for the various Prakrit varieties to the corresponding etym-only codes. That is not hard as I've added tracking for all the full-language Prakrit codes to Module:languages. Benwing2 (talk) 05:53, 16 September 2023 (UTC)
BTW I have given new codes to all the Prakrits, keeping the old ones as aliases for now until I have a chance to rename them. The old codes were badly named by someone who didn't understand our system of codes. The least bad ones are the inc-* ones, which should be pra-*. The worst ones are the *-prk codes, because any foo-bar code is supposed to be a variant of code foo, but in fact in the *-prk codes, the first part refers to a completely unrelated language; e.g. abh-prk is the old code for Abhiri Prakrit, but abh is Tajiki Arabic. All the new codes are of the form pra-*, where the * is consistently the first three letters of the etym language's canonical name. Benwing2 (talk) 06:24, 16 September 2023 (UTC)
Done Done --RichardW57 (talk) 07:10, 16 September 2023 (UTC)
I have deleted the language codes for all 7 Prakrit varieties moved under Prakrit language. Benwing2 (talk) 23:10, 18 September 2023 (UTC)
And I've fixed the failure of Module:Brah-translit/testcases caused by it using a now-removed ISO 639-3 language code. --RichardW57m (talk) 14:52, 19 September 2023 (UTC)
@RichardW57, Benwing2: Excellent work completing this much delayed change! —AryamanA (मुझसे बात करेंयोगदान) 22:02, 1 October 2023 (UTC)

Positioning of Further reading, References, etc.

I've been coming across some entries of late where there are reference and further reading headers in between sections, for instance rebate, which has the further reading section between the noun sense and the verb sense (and no references or further reading for the verb sense). Although I can see why someone may do this, I find it cluttered and liable to duplication. My understanding based on WT:EL is that these headers should come right at the end of each language's entry and be on the same level as etymology, anagrams, etc. What do others think, do we have (or should we have) an actually policy on this? Helrasincke (talk) 17:06, 16 September 2023 (UTC)

WT:EL is actually farly loose about this. Search "nesting" for the relevant part: broadly, it says that any sections pertaining specifically to a specific part of speech (or etymology) should be nested under it. Another, more common example is nesting these sections at the end of an etymology section in an entry with multiple etymologies, and the "Alternative forms" header is also relatively frequently nested in this way. AFAIK, the only header that is never nested like this is "Anagrams", since it can never only pertain to one subsection. On the other side of the coin, you'll also find occasional entries that promote sections that are usually placed at lower levels (e.g. Latin mina has a single declension section instead of listing the same table under every etymology). I would say in the specific case at rebate though I don't see much advantage to nesting it rather than putting it at the end. —Al-Muqanna المقنع (talk) 18:27, 16 September 2023 (UTC)
@Helrasincke: You tell an untruth about rebate. The 'further references' section is part of the lemma entry for the noun. And there are multiple noun senses, so you can't just mean a noun sense when you write 'between the noun sense and the verb sense'. --RichardW57 (talk) 10:50, 17 September 2023 (UTC)
That said, we recently had a discussion when I think we concluded that references and further reading should come at the end of the language section, though there may have been an alternative view that they may instead go at the end of term-containing etymology sections. It was decided that the policies whould be amended to prohibit references at the end of lesser etymology sections. --RichardW57 (talk) 10:50, 17 September 2023 (UTC)
If there was such a discussion it hasn't been implemented, but in general any contested change to actual policy like EL needs a formal vote in any case. —Al-Muqanna المقنع (talk) 11:18, 17 September 2023 (UTC)
@Al-Muqanna: The discussion is at WT:BP/2022/December#Location_of_Footnotes_for_Etymologies. --RichardW57m (talk) 17:20, 19 September 2023 (UTC)

Classes for Sanskrit Roots

Right now, we have classes for verbs (e.g. Category:Sanskrit class 6 verbs) and classes for roots (e.g. Category:Sanskrit_roots_of_class_8). This is redundant, and now that we're updating Sanskrit roots and verbs, I'm in favor of abandoning the latter. Each root can have several formations, like करोति (karoti, Class 8), कृणोति (kṛṇoti, Class 5), करति (karati, Class 1), and कर्ति (karti, Class 2), and all such verbal formations are lemmas in their own right and properly categorized underneath the label of "Sanskrit class X verbs". Putting stuff like "1 A" beside the root headword and categorizing the root like that is just confusing. (Notifying AryamanA, Bhagadatta, Svartava, JohnC5, Kutchkutch, Inqilābī, Getsnoopy, Rishabhbhat, RichardW57): Dragonoid76 (talk) 07:51, 17 September 2023 (UTC)

@Dragonoid76 Agreed. Benwing2 (talk) 01:46, 18 September 2023 (UTC)

Unforewarned deletion of Prakrit words

Why was Prakrit 𑀯𑀚𑀺𑀭𑀗𑁆𑀕𑀩𑀮𑀺 (vajiraṅgabali) deleted a few hours ago without warning? The deleter was @Pulimaiyi. I can at least make a guess as to why Prakrit vajiraṅgabali was deleted without warning, though I'd be assuming a non-existent policy. --RichardW57 (talk) 10:37, 17 September 2023 (UTC)

Those were unattested, that's the most likely reason. And anyways, it was a good decision as Prakrit editors are not as active and RFV would be a waste of time. Clearly the entry was an IP's trash creation. Svartava (talk) 12:29, 17 September 2023 (UTC)
@RichardW57 I will tell you this: no word that I have cited, even with one cite only, has ever been deleted. In fact, I've literally WANTED to delete entries, and if there was one or maybe two on there, it wouldn't be deleted. Only stuff I've added with maybe only one very questionable cite has ever been deleted. --Geographyinitiative (talk) 12:33, 17 September 2023 (UTC)
@Geographyinitiative, Benwing2: So next time asks me to fix an entry with neither a quotation nor a mention, should I tell him to take a hike because it might suddenly be deleted as soon as someone notices I've edited it? --RichardW57 (talk) 23:32, 17 September 2023 (UTC)
Yes, there is a Wiktionary policy that prohibits users from adding words that do not exist and that policy, unlike the word I deleted, is existent. There is no point in stalling something so obvious by taking it to the RFD. -- 𝘗𝘶𝘭𝘪𝘮𝘢𝘪𝘺𝘪(𝘵𝘢𝘭𝘬) 12:37, 17 September 2023 (UTC)
@Pulimaiyi: So why haven't you fixed the etymology that links to it? --RichardW57 (talk) 23:34, 17 September 2023 (UTC)
@Pulimaiyi I would strongly recommend when you delete a term that you include a comment in the deletion message explaining why it was deleted. Otherwise an outsider like me has no idea whether the deletion is justified. Benwing2 (talk) 01:39, 18 September 2023 (UTC)
@Benwing2, RichardW57: Well, I'll admit I needed to do more than just delete it. I was alerted of this entry on Discord and told to take a look; I saw that this term does not exist, so I offhandedly deleted it on mobile mode without bothering to provide an explanation. -- 𝘗𝘶𝘭𝘪𝘮𝘢𝘪𝘺𝘪(𝘵𝘢𝘭𝘬) 12:48, 19 September 2023 (UTC)

Category for lects/varieties?

Currently we have categories like Category:Regional English and Category:Regional Ancient Greek for regional dialects of languages. But there are also non-regional lects, e.g. chronolects (Category:Early Modern English) and sociolects (Category:English Polari slang, which I think should probably be called Category:Polari; I think also Category:Epic Greek, although maybe that is a different type of lect; also ethnolects like Category:African-American Vernacular English). Currently, these non-regional lects go directly under the top-level language category but I think it's a good idea to group them under some subcategory, which would also be the parent category of Category:Regional English, Category:Regional Ancient Greek, etc. What should the name of this category be? Possibilities:

Benwing2 (talk) 23:30, 18 September 2023 (UTC)

Offhand I like "Varieties of English" best ("English varieties" second best). Side note, it strikes me as quite ... suboptimal ... that entries are haphazardly in Category:Regional English—that top-level category, not a subcategory, which would be sensible—or in Category:English dialectal terms, or in both. BTW, how do we want Category:English dialectal terms and the new varieties category to interact? At all? Not at all? - -sche (discuss) 05:30, 19 September 2023 (UTC)
@-sche God, this is a mess. There's a "regional" label that classifies into Category:Regional English and a "dialectal" label that classifies into Category:English dialectal terms. Some of each has a location label next to it, but some don't. Maybe there's a slight difference between the two in that "regional" could mean "as used in standard English in such-and-such region" whereas "dialectal" could mean "as used in dialectal English in such-and-such a region" but I seriously doubt this distinction is maintained (or even maintainable?). Another possible difference is that "dialectal" could refer to sociolects like AAVE whereas presumably "regional" only refers to regional lects. I would (a) merge "regional" and "dialectal" labels into Category:English dialectal terms, (b) place this category under Category:Varieties of English, (c) see if I can do a bot run to remove 'regional' and 'dialectal' labels when there's an adjoining toponym label. Benwing2 (talk) 06:03, 19 September 2023 (UTC)
I would be hesitant to bot remove those labels whenever there's a toponym label, if I understand what you mean correctly. I think some people use labels like "US, dialectal" to mean the term is used in either (1) certain regions of the US or (2) the US and certain non-US dialects. In both cases, removing "dialectal" would be misleading. But maybe that's not the kind of case you have in mind. Andrew Sheedy (talk) 06:16, 19 September 2023 (UTC)
@Andrew Sheedy It was cases somewhat like that I was thinking of, but often people write specifically {{lb|en|UK|_|regional}} to produce (UK regional) without any comma; in that case I'm pretty sure it's safe to remove the regional (or dialectal) tag. Benwing2 (talk) 07:07, 19 September 2023 (UTC)
I see. But couldn't that mean "regional within the UK"? I'm still not sure that this can be automated. I think people are too inconsistent with how they use the labels. Andrew Sheedy (talk) 14:37, 19 September 2023 (UTC)
I see that -sche made the same point below. Andrew Sheedy (talk) 14:38, 19 September 2023 (UTC)
I share the concern that the distinction between Category:Regional English and Category:English dialectal terms is not maintainable in English (and possibly not even coherent; certainly, people talk about words as pertaining to the "dialect" of e.g. Indiana even if the words are merely specific to the region and the words' users all have a GenAm accent). I am not sure whether they should be merged for all languages (maybe! but it seems like it might spark more debate than just making the labels categorize the same way for en). I had been going through Category:Regional English and slowly checking and moving them to "dialectal" by hand. I interpret at least some uses of {{lb|en|UK|_|regional}} as meaning something is regional within the UK (not UK-wide), so in general it might be a better idea to have humans try to replace the dumb vague labels with more specific labels ("regional"→"Cornwall"), rather than having bots remove them, even as much as I like the idea of not making humans do work. For any languages for which we keep {{lb|en|regional}}, I wonder if we should use the same naming scheme as Category:English dialectal terms, i.e. "Foobar regional terms" rather than "Regional Foobar". It does seem like "Regional English"/"Varieties of English"/any other name we could realistically give the category would represent a different naming scheme than Category:English dialectal terms, so having either one as a subcategory of the other feels slightly awkward, like having "en:Biology" as a subcategory of "English scientific terms" or something; I'm not sure how to resolve that awkwardness, or whether anyone else minds it to begin with. - -sche (discuss) 08:14, 19 September 2023 (UTC)
@-sche I am fine with 'English regional terms' in place of 'Regional English'. I think the adoption of 'Regional English' was meant to parallel the subcategories, e.g. American English, British English, Northumbrian English, Durham English, etc., but there's no specific reason they need to be parallel. I still think we need Category:Varieties of English *above* whatever we name the overall regional category, so we can have a place to capture non-regional varieties of English as well. For now maybe we can merge the renamed Category:Regional English and Category:English dialectal terms without merging them for other languages and without changing the labels themselves, in case there is some info captured in the use of 'regional' vs. 'dialectal'. I understand your concern about {{lb|en|UK|_|regional}} possibly indicating regional within the UK, although I'm not sure how you'd figure that out on a case-by-case basis without significant research. I think we need to come up with some more specific guidelines (not necessarily binding policy) on how to use labels like 'regional' and 'dialectal' vs. toponym labels, so that at least some people use them consistently. As for the awkwardness of different naming schemes, maybe we shouldn't worry about it; internally they're all handled by the poscatboiler system so the breadcrumbs will work out correctly. Benwing2 (talk) 08:46, 19 September 2023 (UTC)
Good point, I think "Regional Foobar" works well as a name for a category that contains "Category:American Foobar", "Category:Canadian Foobar", etc. I think "Foobar regional terms" would work well for a category containing individual terms that are tagged {{lb|foo|regional}}, but I guess "Regional Foobar" also works as the name for a category containing such entries, since we put American Foobar entries in "American Foobar"; I withdraw my suggestion about "Foobar regional terms". Re "significant research": alas, that's what may be needed, I'm not sure there's a way to unstupidify the category by bot — I mean, we could bot replace {{lb|en|regional}} with {{lb|en|dialectal}} (or just change how that label displays/categorizes, whether in general or just when lang = en), that'd be fine, but the issue of "there's a vague label and it's not clear what it means, it should be replace by labels that spell out what regions or dialects" probably does need a human to fix. I would support any guidelines that would discourage people from adding "it's regional! ;)" or "it's dialectal!", in favour of saying what dialects. (Ideally I think we would one day get rid of the vague "regional" and "dialectal" labels entirely in favour of just specifying where, but that may not be feasible in the near term.) - -sche (discuss) 16:58, 19 September 2023 (UTC)
The problem on your last point is that a lot of sources (like the OED) do just use a "dialectal" label and it's often difficult to attest and determine specific dialects. Reasonably often dictionaries say a term is "obsolete or dialectal", and it's easy enough to attest the obsolete usage but dialectal survival is then taken on faith with the OED (or similar source) as a reference. Of course in that case it might be preferable to ignore the claim about dialectal survival unless it can actually be attested before adding that label. —Al-Muqanna المقنع (talk) 18:56, 19 September 2023 (UTC)
Yeah. Specificity is an ideal to strive for, not necessarily something we can do right away with the resources we've got. Wright's old English Dialect Dictionary sometimes labels things "general dialectal" meaning it's not actually restricted to any particular dialect(s) or English-speaking countries he covers but he ... evidently felt obliged, given the scope of his work, to say it was dialectal and not just colloquial, heh. - -sche (discuss) 01:27, 25 September 2023 (UTC)
@Benwing2, -sche: I think the best name would be "sub-lects", which would make clear that they're not languages in their own right, but leave open whether they're divided by space, time, social/ethnic/political/religious differences, or some combination. Chuck Entz (talk) 02:55, 25 September 2023 (UTC)

Proposed bot job: importing OED pronunciations

In the new OED website, IPA pronunciation information is freely available, so it wouldn't be very difficult to import them. For example, compare unconventional (missing a pronunciation section) and the OED entry. I don't think pronunciation transcriptions are protected by copyright. Ioaxxere (talk) 00:25, 19 September 2023 (UTC)

@Ioaxxere I am not sure of that. Before doing something like scraping the OED, we need to be really sure we're not going to be in violation of copyright. Benwing2 (talk) 02:40, 19 September 2023 (UTC)
We would likely be okay to import US pronunciation guides, as pronunciations are not creative expressions of an idea (the doctrine of merger would apply) and therefore cannot be subject to copyright. But the database right would be a likely hurdle in terms of UK works like the OED. This, that and the other (talk) 05:27, 19 September 2023 (UTC)
Just to note that we don’t transcribe in the same way as the OED. For example, we prefer /æ/ to /a/, /ɹ/ to /r/, /aɪ/ to /ʌɪ/, and so on. — Sgconlaw (talk) 05:36, 19 September 2023 (UTC)
In principle this could probably be fixed automatically though I also think the legality is questionable. Database right is recognised in the UK and not the US, so Wiktionary as a US-based interest could arguably ignore it, but WMF projects (with some exceptions) have tended to prefer to conform to all jurisdictions concerned and not just the US. —Al-Muqanna المقنع (talk) 11:39, 19 September 2023 (UTC)
@Ioaxxere: How old is this data? Database protection only lasts for 15 years. --RichardW57m (talk) 15:19, 19 September 2023 (UTC)
@RichardW57m: Rights are renewed on substantive modification (the exact wording of the legislation is "any substantial change to the contents of a database, including a substantial change resulting from the accumulation of successive additions, deletions or alterations") and the OED Online is in principle constantly adding entries so there's probably no expiry date unfortunately. —Al-Muqanna المقنع (talk) 15:23, 19 September 2023 (UTC)
  • Worth bearing in mind that the OED pronunciation differs in some subtle ways from our own. E.g. in UK pronunciations it uses /a/ rather than /æ/ for the TRAP vowel – which actually I think is more accurate now but it's not what our own guide recommends. Edit, Sgconlaw and I were apparently saying exactly the same thing at the same time! Ƿidsiþ 05:40, 19 September 2023 (UTC)
    Yeah! *High five*. — Sgconlaw (talk) 05:56, 19 September 2023 (UTC)
    I wonder if we should have a new discussion about /a/. I recall from the rather small 2014 vote(!) that it was UK editors saying "nobody says /æ/, we say /a/" and non-UK and sometimes non-native-English-speaking editors saying "noo, the UK pronunciation is /æ/", and I think we now have a few more active UK editors than last time, plus some of the non-UK editors are no longer active, so maybe it'd work to switch... - -sche (discuss) 07:50, 19 September 2023 (UTC)
    @-sche As a non-UK speaker I can attest that UK short ă sounds not as high and fronted as my /æ/ in words like cat (although only slightly different; the major difference is before /n/ and /m/). So maybe /a/ = low front vowel is appropriate. I wonder if a lot of the people insisting that UK has /æ/ are thinking of /a/ as low and central (as it's often used, in languages such as Spanish) rather than low and front. Benwing2 (talk) 08:50, 19 September 2023 (UTC)
    I don't much care whether we settle on /a/ or /æ/ for the trap vowel, since phonemic transcriptions are little more than algebraic constants anyway, but what I don't want is to have one for RP and another for GenAm since there's no reliable, consistent difference between the two accents' pronunciation of the vowel in question. (Listen to the audio file of a British speaker pronouncing chav and an American speaker—me—pronouncing chavs, and you'll hear the vowels are essentially identical.) Some Brits have a more open trap vowel, and some have a more close one. Some Americans have /æ/ raising in some contexts, but others (e.g. Californians) have a very open . Then there's the issue of the label "RP". UK editors may well say "nobody says /æ/, we say /a/", but they would probably also say that almost no one speaks RP anymore either. A lot of entries use the label {{a|UK}} rather than {{a|RP}}, but that's even worse since there are dozens if not hundreds of different accents in the UK. Even if most Brits do pronounce cat , that doesn't mean that's the RP pronunciation. So if we want to use the transcription /kat/, what accent label should we use for it? —Mahāgaja · talk 08:52, 19 September 2023 (UTC)
    Agreed 100%. It's true in practice that the standard realisation in Southern England is more or less but I don't see a convincing reason to move /æ/ → /a/, and especially not if it's only going to be for BrE. It's a spectrum. —Al-Muqanna المقنع (talk) 11:35, 19 September 2023 (UTC)
    As a BrEng speaker who speaks in RP (more-or-less), I agree that there's no phonemic difference: they're just allophones. If I had to say there was any difference, /æ/ is probably more likely when the syllable is stressed. Theknightwho (talk) 02:29, 26 September 2023 (UTC)

splitting "Undetermined" language

@-sche Can you help me think this through? The "Undetermined" language is defined as "This language contains terms in historical writing, whose meaning has not yet been determined by scholars." However, in many cases that's not at all what it's used for; it's rather a yucky wastebasket for all sorts of crap. For example, it's used for Idiom Neutral, which is defined as an etymology-only variant of und, when in reality it has terms with totally known meanings. It's also used for the parent language of Xiongnu, Turduli, Weyto and all the substrates, which are defined as etymology-only languages, and are better viewed as unclassified languages. It's also used for the 2,455 terms in Category:Undetermined language links, where generally the terms themselves are of known meaning but the language they belong to either isn't clear or isn't in Wiktionary. An example of the former is {{m|und|chicolātl}} in chocolate, where it isn't clear what language it was in, and an example of the latter is "From Proto-Crow-Hidatsa {{m|und|*cí•ta}}" in chíisa, where it's simply that we have no code for Proto-Crow-Hidatsa. It sounds like to be completely thorough we'd need an "Unclassified" language (for substrates, Weyto, Xiongnu and such), an "Unknown" language (for terms like chicolātl) and an "Uncoded" language (for Proto-Crow-Hidatsa, Idiom Neutral and the like). If we do this, however, if would take significant work to go through the 2,455 terms deciding which ones are "Unknown" and which ones "Uncoded". So either we could combine Unknown and Uncoded into something (Misc?) or we could keep them separate and just convert all the und links to Unknown, converting them manually over time to Uncoded in the appropriate cases. Benwing2 (talk) 03:45, 19 September 2023 (UTC)

Ha, what we now use Module:etymology languages/data and "etymology-only" codes for has finally diverged so much from what they were originally for that languages-only-mentionable-in-etymologies no longer fit well in their titular module! (That's fine; things change.) The codes for Turduli etc (as well as e.g. Kassite in Module:languages) are — ISO-validly, AFAIK — using und as their family prefix a la gmw-cfr, and are in the "etymology" module because substrates are only suitable for mention in etymologies (and at the time it was added, Turduli was also felt to be that way, although it's apparently sparsely attested, so should perhaps be promoted to Module:languages a la Kassite). Whoever set them as children of a parent language und must've thought every language in the module had to be the child of a parent language, but I don't think it makes sense to consider e.g. the substrates to have a parent language. What are the benefits to considering them to have a parent called "Unclassified" vs no parent, apart from allowing the module to require everything to have a parent? Maybe that's reason enough to do it, or maybe we should just move the substrates to Module:languages and let Module:etymology languages/data be what it has become, a module for "things that are varieties of other things" instead of a module for "languages only suitable for mentioning in etymologies"? (In which case, rename it "Module:language varieties/data" or something?)
"Idiom Neutral" is a mess; I would guess that the reason it's entered in the "etymology" module is that somebody thought that was where "not allowed in mainspace" conlangs go, but this seems wrong: I think the code can go in Module:languages/data/exceptional like Black Speech, Bolak etc, even as the entries continue to be confined to appendixspace. And it should be renamed "art-" not "und-", no?
Some 'uncoded' languages should just be coded; I'm not sure what to do about cases where we may feel something like e.g. "pre-Proto-Algonquian" or "post-Proto-Algic" is too 'fine-grained' to give a code (and I'm not sure which of those boats Proto-Crow-Hidatsa is in), but maybe we should just not use {{m}} in that case, maybe we just use manual italics etc? The ISO does have mis (short for "miscellaneous") for uncoded languages... - -sche (discuss) 07:44, 19 September 2023 (UTC)
@-sche OK. I do believe we should rename etymology-only languages to "language varieties" (or maybe "lects") since that's what they've become (as does User:Theknightwho; I think it's just a matter of doing the technical work to make this happen). I think actually the substrate etym languages used to have the family qfa-sub as their parent, but User:Theknightwho wanted all etym languages to have a language as a parent rather than a family in order to simplify the code (which makes a fair amount of sense IMO; the etym languages are now handled through inheritance, and that would get all messed up if some etym languages had no parent or had a family as a parent). Maybe though we can create a placeholder substrate (or "unclassified") full language rather than using und as the parent (and potentially use it also for terms in unknown languages aka {{m|qnc|chicolātl}} where I've arbitrarily chosen qnc for Unclassified since made-up codes must begin qa through qt). Making the substrates full languages is also a possibility, although they won't necessarily have any lemmas. As for uncoded languages, if there's a code for that, IMO we should definitely use it instead of und. As for Idiom Neutral, yeah it should definitely be made an appendix-only full language; I have no idea why it is structured as it is currently. Benwing2 (talk) 09:03, 19 September 2023 (UTC)
@Benwing2: Please don't break fragment identifiers if you change the codes. The gadgets or whatever for requests use language codes, and I think some people have explicitly used used the fragment identifiers generated by {{senseid}}. I have linked at least one RfV under language code und because the language was debatable. --RichardW57m (talk) 10:11, 19 September 2023 (UTC)
Let's see, chicolātl we could classify as nah / azc-nah since the issue is just that it's not clear exactly which Nahuatl lect(s) it's a word in, but it's known that the lect is Nahuatl. Likewise, the Finnish entries seem like they should either be allowed to use {{m|gem|foo}} to create an unlinked italicized mention of foo that's tagged as gem, or else they should use manual italics, since putting them in {{m|und||intentionally unlinked}} is doing nothing useful AFAICT. The Bontoc term in Reconstruction:Proto-Austronesian/bahi should just use one of the ISO codes Bontoc has; it being tagged "und" is silliness.
For Turduli, let's make it a "full" language code like Kassite (it's apparently sparsely attested) and the issue of it being subsumed under "parent = und" goes away. (If we also want to avoid having Turduli, Kassite, etc prefixed with "und", we could come up with a general-purpose ISO PUA family code like "qnf" ("nonclassified / noncoded family") or something to prefix Kassite, Turduli etc with, a la the substrates being "qsb-". For "nonclassified language", do we need a code qnc or would they always be able to use mis?) Idiom Neutral should be "art-" and a "full" (Module:languages) code. And we could certainly make some code for substrates to have as their parent (or just mis?). The link in brême looks like it's supposed to be gmw-pro or is intended to be tagging a term as belonging to a family a la the Nahuatl and Finnish/Germanic examples. I wonder how much of Category:Undetermined language links is silliness like that which could be cleaned up. - -sche (discuss) 17:41, 19 September 2023 (UTC)
@-sche Probably a lot, in response to your last sentence. I think allowing family codes in links is a good idea; e.g. all of the "Old Iranian" and "Middle Iranian" terms that User:Vahagn Petrosyan likes in etymologies (grrr ...) use 'und', and if we allow families, they could use the "Old Iranian" and "Middle Iranian" family codes. For the rest, I will get to work. Benwing2 (talk) 18:48, 19 September 2023 (UTC)
@Benwing2 fwiw, I took a look at the Slavic-language pages in Cyrillic w/ undefined links - there's a small handful of them. The most common use case is word forms that reflect an earlier stage of the language before it was written down. Chernorizets (talk) 05:37, 21 September 2023 (UTC)
@Chernorizets Thanks. By earlier stage you mean something like Early Proto-Slavic, where e.g. Proto-Slavic *noťь would be *nakti? I think those could reasonably use the family code 'sla' family code once support for family codes in links is added. Benwing2 (talk) 06:00, 21 September 2023 (UTC)
On second thought, I wonder whether we should lett {{m}} etc use family codes, or whether that's a can of worms as far as people sloppily mentioning Proto-Germanic (or German, or some random thing they invented) as "Germanic", etc: would we be better off just liberally recognizing as etymology languages any of these which we find ourselves needing, i.e. allow "Old Iranian" as an etymology-language (parent Proto-Iranian), "Early Proto-Slavic" (parent Proto-Slavic), "pre-Proto-Algonquian" (parent Proto-Algic), etc? I'm not sure; what do you think? (If the issue with Old e.g. Iranian or Early Proto-Slavic is that we don't want to generate links to Reconstruction:Old Iranian/foobar or whatever (?) because for some reason we've decided we want to routinely cite it as a thing that exists and yet never have pages for it, then maybe we (a) assign them some value that tells link templates not to generate a link or (b) require people to always {{m|en||suppress}} the link themselves?) - -sche (discuss) 06:35, 21 September 2023 (UTC)
@Benwing2 the cases I saw would rather belong to something like dialectal Common Slavic, where certain sound changes had different outcomes across the Slavic-speaking area, such as liquid metathesis in South and West Slavic and pleophony in East Slavic (more info). TBH I don't see the value of using {{m}} vs plain old italic in these cases. They just show an extra step between Proto-Slavic as reconstructed and the eventual attested language form, but I'm not bought on the benefit esp. when it pertains to predictable outcomes of sound changes. I'd rather have an appendix like "About Slavic languages" that summarizes some of those changes. Chernorizets (talk) 08:26, 21 September 2023 (UTC)
@-sche@Chernorizets There are various advantages to having a template like {{m}}: (1) it wraps the mention in the appropriate "mention" CSS, which (theoretically) allows a user to customize the display to something other than italics; (2) if the text is in a non-Latin script, it wraps it in the appropriate CSS to make sure the script is displayed with the right fonts; (3) it allows for {{m+}} to mention the language or family name. For example, since "Old Iranian" and "Middle Iranian" are etymology-only families rather than etymology-only languages, they are allowed as the source in {{bor}} and similar templates but not {{m}}, with the result that mentions of these terms typically write Old Iranian {{m|und||*foo}}, which is non-ideal. Yes, if we allow families in {{m}} then people will sometimes use it to put borrowing or inheritance from "Germanic" terms that are often garbled, but this already happens (e.g. in names with etymologies sourced to popular baby-name sites and such), it's just done using raw italics or {{m|und}}. Having a family code will make it easier to track such usages and (eventually) correct them. We can make it so that if a family is given, no link is provided by default. Benwing2 (talk) 20:01, 21 September 2023 (UTC)

Should terms with borrowed parts be in a borrowing category?

E.g., the OED states that “quartal is a borrowing from Latin, combined with an English element. Etymons: Latin quartus, -al suffix1.”; “alphabetical is a borrowing from Latin, combined with an English element. Etymons: Latin alphabeticus, -al suffix1.” Our entries, using “From {{suffix|en|quārtus|al|lang1=la}}.” and “From {{af|en|alphabēticus|-al|lang1=la}}.”, are not in a borrowing category. J3133 (talk) 09:30, 19 September 2023 (UTC)

I'd say if the entire word hasn't been borrowed, it shouldn't be in a borrowing category, only in a derived category. However, are we positive that quartal isn't simply borrowed directly from Latin quārtālis? That seems more likely. Alphabetical, on the other hand, probably really did have the -al added within English, since there's no *alphabeticālis in Latin. —Mahāgaja · talk 09:55, 19 September 2023 (UTC)
There are issues like adapted borrowings where a verb is borrowed and an infinitive ending is added so that the verb can conjugate. angażować comes to mind. Most etymological dictionaries don't mention the suffix, but it is used to derive other terms, and it's not the same as a regular infinitive suffix, which would be just -ć. Vininn126 (talk) 10:06, 19 September 2023 (UTC)
The OED is correct about quartal afaict, it is recent (largely 20th c. with very sporadic 19th c.) in English and doesn't correspond to the meaning in Medieval Latin, which beyond what's in our entry also seems to have been a measure of land and a dry measure. 16 pages deep on Google Books I can't find attestation in New Latin except when talking about Medieval Latin documents. —Al-Muqanna المقنع (talk) 10:44, 19 September 2023 (UTC)

Unpacking pronunciations

One of the principles proposed in Wiktionary:Votes/pl-2022-07/Stubifying_alternative_forms is that repetition between alternative forms be avoided as much as possible. I do not recall any opposition to the principle, but rather worries about how to express that goal in rigid rules. Thus, if all variants were pronounced the same, the pronunciation would be given once, at the main entry for the lemma. (I'm ignoring the complexities of inflection.)

Now one issue that arises in a massively polycentric language is that one may have a mix of spellings and pronunciations that in a sense all belong to the same lemma, but not all pairs of spelling and pronunciation are compatible. A theoretical complication is that the PoS headers all call the forms lemmas, though I'm not sure that invisibly labelling some of them as variants would actually help. Pronunciations have not been entered for all forms.

So, given a spelling and a locality or region, how is a user expected to determine the pronunciation if any recorded by Wiktionary? Now, with any luck, there may be some clear and adequate guide that I have overlooked, and a reference to it would answer this question.

I'm asking because it occurs to me that I may have to add {{rfp}} to some Mon entries to explicitly state that Wiktionary currently doesn't give the pronunciation. --RichardW57m (talk) 09:52, 19 September 2023 (UTC)

If the pronunciation of the two forms is the same, the stub form should be marked as an alternative spelling rather than an alternative form. Thadh (talk) 14:07, 19 September 2023 (UTC)
@Thadh: That may help in some cases, though what do I do if the pronunciation of the main forms differs by region but the pronunciation of the stub form doesn't? For example, Mon တ္ကံ seemingly has different pronunciations in Burma and Thailand, and the Burmese pronunciation can be spelt differently, as ကံ, and if we had an entry for that spelling from Thailand, it would be pronounced the same way as in Burma. I don't want to split the senses by region as {{alternative spelling of}} and {{alternative form of}}. The 'language' 'Thai Mon' is being merged back into Mon; the Wiktionary language split was never approved. --RichardW57m (talk) 16:28, 19 September 2023 (UTC)
You do the following: You label the pronunciation of တ္ကံ as {{a|Myanmar}} and {{a|Thailand}} (for Burmese and Thai pronunciations, respectively), you label the alt spelling as {{alt|mnw|ကံ||Myanmar}} (in the Alternative forms section) and you add the label template at the altspelling in the form of {{lb|mnw|Myanmar}} {{alternative spelling of|mnw|တ္ကံ}}. Thadh (talk) 16:42, 19 September 2023 (UTC)
My preference is to explicitly spell out when forms are pronounced identically to some other form, so that users know. I sometimes do this by spelling out "like foobar" manually (at other times I leave the work undone, so the entry doesn't actually have any indication that it's pronounced like some other entry, alas), although ideally we would make that a template. This is because it seems like not many people grasp the intended distinction between "alternative forms" and "alternative spellings", or maintain it (even I am guilty of not always maintaining it), and while it'd be ideal to eventually clean up all entries to maintain the distinction, I think that (if anything) having {{pronounced like}} will only help that goal (as then a bot can ensure all entries with that template are listed as "alternative spellings" and not "alternative forms"). FWIW I also use this approach when there's an entry like "Liu syndrome" or "node of Ranvier" where I think it's more sensible to just give the pronunciation of the unusual element and direct people to syndrome and of if they need to know the various ways of can be pronounced, rather than repeat those ways in the entry for every other English entry that contains of (or e.g. and). - -sche (discuss) 16:34, 19 September 2023 (UTC)
For Mon in particular, there is the worrisome phenomenon of a 'reading pronunciation' which we typically don't have information on. --RichardW57m (talk) 15:06, 22 September 2023 (UTC)
There's an example of the distinction at ခၞာခၟာဲစာၚ်, where one pronunciation is described as 'reading' and the other is called 'speaking'. --RichardW57 (talk) 13:56, 23 September 2023 (UTC)

Something (in the way she moves...)

We've had this dicussion before, so apologies if I'm repeating. I collected a list of some entries with placeholder something in them and put them at Wiktionary:Todo/phrases not linked to from components/2022-09/something. I don't like them at all, as it seems like "something" is literally part of the phrase, like the phrases at something#Derived terms e.g. there must be something in the water. We need to fix them, possibly by putting brackets or quote marks around "something", or making the word "something" a different colour, or smallcaps. There needs to be a distinction between there must be something in the water and there must be (something) in the water P. Sovjunk (talk) 17:34, 21 September 2023 (UTC)

By the way, the definition at there must be something in the water really sucks P. Sovjunk (talk) 17:35, 21 September 2023 (UTC)
And, of course, someone might need similar treatment. P. Sovjunk (talk) 17:37, 21 September 2023 (UTC)
And probably the somewhere bit at get one's butt somewhere P. Sovjunk (talk) 17:42, 21 September 2023 (UTC)
You'd have to be more specific to get something better than what I've just done. DCDuring (talk) 19:11, 21 September 2023 (UTC)
About a year ago I created a page at the great … in the sky, which we have now moved to Appendix:Snowclones/the great X in the sky. I've never been happy with it being considered a snowclone, because to me a snowclone is a phrase whose meaning is different when different words are plugged into it, whereas the intending meaning of the great ... in the sky is always the same. Whatever we choose to do with this new list of pages, I'd ask if the page I created can be grouped with them as well, and potentially moved back into the mainspace. Soap 17:44, 22 September 2023 (UTC)
great something in the sky (great (something) in the sky) would have fit better with our lemma format. A few redirects from the most popular or attestable instances of 'somethings' would help users find the lemma and prevent too many spurious entries. DCDuring (talk) 18:31, 22 September 2023 (UTC)

make T:altform vs T:altspell spell out intended relation to pronunciation?

The question of how best to indicate that one form foobar has the same pronunciation as another form fubar, without repeating all of the (potentially many) pronunciations in both entries (destined to fall out of a sync and imply they're not pronounced the same), has come up before, e.g. last July and above at #Unpacking_pronunciations. It has been suggested that this should be indicated purely by the use of T:altspell if the pronunciation is identical vs T:altform if it differs. Such a distinction is not currently consistently maintained and nothing in the output of the templates themselves makes it clear, but what if we changed the templates (or made new templates) so that they did spell out that distinction? Have T:altspell spell out Alternative spelling of term, which it is pronounced identically to? And have T:altform say the opposite? Then everyone could see the intended distinction and correct any entries that use the 'wrong' templates. (This won't solve all the cases where it would be useful to indicate x is pronounced like y, since there are still things like an abbreviation sometimes being pronounced like a full phrase, with whatever choices of /ænd/ vs /æn/ vs /ənd/ vs /ən/ etc a speaker chooses to make.) - -sche (discuss) 18:47, 21 September 2023 (UTC)

Yes this sounds like a good idea. Although sometimes for alternate forms we just omit the pronunciation altogether, right? Soap 18:50, 21 September 2023 (UTC)
Yes...? but "we're intentionally withholding the pronunciation because it's identical to the linked lemma form" and "no-one has bothered to add the pronunciation yet" is not distinguishable to readers, unless we do something like this to spell it out. - -sche (discuss) 19:25, 21 September 2023 (UTC)
{{altform}} is a superset of {{altspell}} and can also mean “I don’t know whether it is only written differently or also pronounced distinctly.” I would have warned you in the discussion above if I had understood that you have essentialized the “intended distinction” between the two templates. Due to the quandaries of the historical pronunciation of our working language in particular, it is otiose to provide examples. Fay Freak (talk) 19:39, 21 September 2023 (UTC)
@-sche I am generally in favor of this. I think we can find some wording to address User:Fay Freak's concerns. Benwing2 (talk) 20:04, 21 September 2023 (UTC)

renaming set categories

Hi. I'd like to propose that we rename all set-type topic categories to make them clearly different from other topic categories, e.g. Category:en:Greek deities would become Category:en:set:Greek deities or something. Such categories could be added using {{set|en|Greek deities}} or {{S|en|Greek deities}} instead. ({{S}} is currently used for wikisource links but it's used on only 3 pages so it could easily be reused.) For those who are confused, there are two types of topic categories: "related-to" categories (topic categories per se, containing terms related to the category in question) and "set-type" categories (categories that are supposed to contain instances of the category in question rather than simply related terms). Category:en:Greek mythology is of the former type and Category:en:Greek deities is of the latter type. Often the distinction is apparent by the name (singular vs. plural) but not always; Category:en:Video games is a related-to category rather than a set-type category. The fact that both are in the same namespace leads to constant confusion between the two, and the result is that some set-type categories don't contain only instances of the category (e.g. Category:en:Moons is supposed to contain a list of moons but it actually contains random junk). Separating the namespaces would allow people to keep them straight. @Chuck Entz who I'm sure has opinions about this. Benwing2 (talk) 23:31, 21 September 2023 (UTC)

Another thing to add is that some categories, per their description, are a mixture of set-type and related-to terms, e.g. Category:Artistic works and children, which are described as "{{{langname}}} names of and terms related to ] ]s.". These would be split into two. Benwing2 (talk) 06:27, 22 September 2023 (UTC)
Support. I suggest prefixing set categories with "Individual", so there would be Category:en:Video games and Category:en:Individual video games. I don't know if you considered this, but there's also a third type of category: "subset categories". For example, Category:en:Athletes, which contains types of athletes rather than individual people who are athletes. In some cases these are already separated, as in Category:en:Video game genres. Ioaxxere (talk) 12:41, 22 September 2023 (UTC)
I think there "are" more than two types of categories. I don't think the "confusion" is at all limited to those who haven't looked at these from your perspective. You have reason to believe that some subset of topical categories ought to be subdivided. What would be the advantage? Would it offset the additional complexity imposed on contributors? BTW, do we already have register-label (":reg:") and industry-label (":ind:") or occupation-label (":occ:") categories? (There are certainly more categories of categories that "need" to be complicated.) Should we have them, just to clarify things further, at least for my benefit. DCDuring (talk) 15:31, 22 September 2023 (UTC)
Support distinguishing set and topic categories. This has been discussed before here (and e.g. here). One idea is to not only name sets "CAT:en:set:Moons", but also name topics "CAT:en:topic:Moons" (lunar, mare, etc), for clarity in both directions. As others noted above, we may need to think about having more than just two types of categories, though: "red dwarf, brown dwarf, white dwarf,..." ("CAT:set:Types of stars"?) is one logical category, "Sun, Sol, Sirius, Antares, Aldebaran, Canopus, Phecda, ..." (Chinese: "織女, ...") is another ("CAT:set:Individual stars"? "CAT:list:Named stars"?), and "solar flare, stellar, ..." is another ("CAT:topic:Stars").
One difficulty: logically, the category containing the names of cities (Berlin, Paris, etc, as opposed to types of cities like "town, village, hamlet, commune, city, metropolis") should be named like the category containing names of specific roads or specific stars ... but naming the 'list of specific named stars' cat just "CAT:en:Stars" would entirely fail to distinguish it from a topic, so it should have a longer name ... but renaming Category:en:Cities in England to a long name like Category:en:set:Named cities in England seems less than ideal.
This is getting off-topic, but I would also love us to sort out the situation where we have 2+ completely separate top-level categories for Names, so that e.g. "Named roads" like London's The Mall, cities like London itself, and the personal name Sergei are in one top-level category, while personal names like Richard and Vadym are in a completely different top-level category. Checking just now, it seems that Doom, as the name of a specific video game, is in ... a third top-level category, and Aldebaran, as the name of a specific star, is in a fourth top-level category. 🙄 - -sche (discuss) 19:59, 22 September 2023 (UTC)
How about CAT:en:set:Names of cities in England? —Mahāgaja · talk 20:06, 22 September 2023 (UTC)
And then "CAT:en:set:Names of stars" for the corresponding set-category of stars? That'd work. (Or Ioaxxere's idea of "Individual cities in England".) I was just worrying some people would not like that such a name is longer for what seems like marginal gain (in that I hardly expect anyone to think "CAT:en:Cities in England" is a topic category, although it could be a set category for "royal borough, city, ..."). But I am prepared to accept greater length to get greater clarity. - -sche (discuss) 20:54, 22 September 2023 (UTC)
Maybe we can get away with using two prefixes (null and 'set:', or 'topic:' and 'set:', or 'rel:' for 'related to' and 'set:', or whatever), and adding the word "types" when we want to indicate what User:Ioaxxere calls subset categories. Hence 'city types in England' if we really want such a category. Benwing2 (talk) 22:20, 22 September 2023 (UTC)
Another idea I'm thinking of is to use "qualified" category names when needed, but unqualified names when there's only one reasonable interpretation and it's obvious. For example, we would have 'Types of stars' and 'Names of stars', but only 'Cities in England' rather than 'Names of cities in England' since it's pretty clear what 'Cities in England' refers to. This would require a bit of judgment but it might be the best way to resolve the tension between clarity and shortness of names. Topic categories can either have some similar qualification ('Terms related to stars'? that is kind of long though) or be distinguished by prefix. Benwing2 (talk) 22:38, 22 September 2023 (UTC)
@-sche @Ioaxxere One issue I'm running into as I look into this more is that the distinction between types and names isn't always obvious. For example, CAT:en:Musical instruments is defined as "{{{langname}}} names of ]s." but are the contents (e.g. Alpine horn, clarinet, clavichord, etc.) actually types? You could argue that "names" includes only proper names, e.g. the Hellier Stradivarius, but I think if we try to enforce this distinction strictly, people will get confused. Benwing2 (talk) 20:16, 23 September 2023 (UTC)
Hmm... what if instead of a "set:" prefix, we have
  1. "Category:en:Types of stars" (red dwarf, ...), "Category:en:Types of musical instruments" (Alpine horn, clarinet, ...) — or "Kinds of..."? — with the category boilerplate saying it's for "LANG terms for types of stars"
  2. "Category:en:Named stars" (Aldebaran, ...), with the category text explaining it's for the "LANG names of specific individual stars" and likewise "Category:en:Named musical instruments" if we add Hellier Stradivarius, etc
  3. "Category:en:Topic:Stars" (solar flare, stellar, etc) / *"Category:en:Topic:Musical instruments" (it may not make sense for every thing to have a category of every variety, so it may only make sense to have "Topic:Music" but not "Topic:Musical instruments")
... and then we potentially just accept that a few specific varieties of "name" category will be named differently (e.g. placenames like "Cities in England" could continue to be named like that, even if we also add a "Types of cities" category and a "Topic:Cities" category; and the two completely different, non-intersecting "male given names" categories Nikolas vs Nicholas are in may continue to both be named "male given names").
Good idea, bad idea, improvable idea? - -sche (discuss) 21:24, 23 September 2023 (UTC)
@Benwing2 Types of musical instruments would be wind instrument, string instrument, etc. Stuff like drum is kind of a grey area though. Ioaxxere (talk) 00:50, 24 September 2023 (UTC)
@-sche This sounds good to me overall. My only comment would be that we can maybe be lax in naming when there's only one reasonable possibility (not just "name" categories); e.g. I don't see how Category:en:Microsoft or Category:en:Metallurgy could be anything but topic categories. Although maybe it doesn't matter so much for shorter names like this; a lot of the poscatboiler category names are longer. Benwing2 (talk) 01:11, 24 September 2023 (UTC)
My concern with having a marker be present only some of the time is that then category names are not predictable; people don't know whether to type in or look for "en:Topic:Metallurgy", or just "Metallurgy". Perhaps this could be mitigated by moving the only-sometimes-present marker to the end, "Category:en:Foobar (topic)", so it shoes up when you start typing "en:Fooba..." and so scripts etc don't have to reckon with two possible prefixes, sometimes "Topic:" and sometimes null? (But "Stars (topic)" sounds kinda dumb to me. IDK, maybe it's fine.) Or perhaps, as you were saying earlier, we could get away with no prefix/marker for topics, if we rely on users to learn that sets always start off "Types of..." or "Named..." and so unprefixed things are always topics? ...except when they're not, like "Category:en:Cities", which is an unprefixed set—argh, this is tricky indeed. I guess we could let both topic and set disambiguation be optional, based on whether we think it could only be read as one or the other type of set, like "Cities", or as a topic, like "Metallurgy"; or whether it could be read as both, like "Stars" (in which case disambiguate all three types of category?). ...I will have to give this more thought... - -sche (discuss) 05:57, 24 September 2023 (UTC)
@-sche My proposal is similar to your last one. We could use hard category redirects from e.g. Category:en:Topic:Metallurgy to Category:en:Metallurgy in case users go looking for the fully-qualified names. Benwing2 (talk) 06:07, 24 September 2023 (UTC)
Yeah (and sorry, not trying to present your idea as mine) ... I'm just trying to think through, if we have a system for distinguishing them, but we only sometimes use the system, do we really have a system? Or will we perennially face the issue that some people add new categories like "Category:en:Tanks" for terms relating to the topic of tanks because they generalized from "that's how Category:en:War is named", while other people add "Category:pt:Tanks" as a set category because "that's how Category:pt:Towns is named"...? I am sympathetic to the goal of keeping shorter names where possible, but I wonder if, especially when we're talking about a categorization schema, it might be better to use systematic names consistently (systematically)? (I'm unsure.) As for category redirects, I'm not opposed, and I do think we should be freer in our use of redirects (for entries), but I do worry that in this situation, it'd mean we'd have perennial maintenance tasks to create such redirects any time a new category is created and to move entries that get categorized into redirects, no? (I guess those tasks are bottable, but someone has to run the bot... an approach that requires less maintenance might be advantageous, I dunno.) - -sche (discuss) 18:34, 24 September 2023 (UTC)
@-sche This is a good point; I didn't think of the impact on new categories. I think in that case your proposal of restricting unqualified categories to toponym categories makes sense. Benwing2 (talk) 21:58, 24 September 2023 (UTC)
I'm on board with distinguishing sets and topics, but I'm hesitant to go down another namespace level and make category names longer. Is there a way to accommodate a distinction under the existing categories? Ultimateria (talk) 22:17, 22 September 2023 (UTC)
@Ultimateria Can you clarify what you mean? The set categories are already distinguished by being placed under Category:List of sets (although a lot of categories are mis-specified), but this doesn't make it obvious which categories are set categories, and runs into problems when a topic and set category want to have the same name. We'd have to add some indication in the name to clarify which are set categories; 'set:' is only four characters. I think the shortest we could get is two characters ('s:' or 'S:' or similar) but I'm not sure how clear that would be. Benwing2 (talk) 22:22, 22 September 2023 (UTC)
I mean within e.g. CAT:en:Moons could we somehow sort the entries by whether they belong in the set or are related to the topic? The only way I know of to do that is by altering the alphabetization, which is not ideal. Ultimateria (talk) 23:08, 22 September 2023 (UTC)
@Ultimateria The bigger issue with this is that there's no way to know which entries are set entries and which ones topic entries unless they're in different categories. Benwing2 (talk) 04:43, 23 September 2023 (UTC)

@Ioaxxere (and anyone else), re your point that "Types of musical instruments" (etc) could itself be divided / mean either "wind instrument, string instrument, ...", or "guitar, piano, ...", or both ... (and I notice e.g. "Types of tanks" could be "main battle tank, light tank, ..." or "Abrams, Tiger, Sherman ..." or both) ... do you think we should have separate categories for "wind instrument, string instrument, ..." vs "guitar, piano, ...", or can they be in one category? I'm trying to think how such categories could be named distinctly, and whether normal users would understand or maintain a distinction between them... and I'm leaning towards thinking that because "Types of..." categories will necessarily contain more than one 'layer' of type already — e.g. surely "woodwind instrument" should go in the same category as "wind instrument", even though "woodwind instrument" is a subtype of "wind instrument" — then also including "flute", "piccolo", "guitar", "piano" etc in the same category as "wind instrument", "string instrumented" etc, even though "guitar" is a subtype of "string instrument", is probably fine...? - -sche (discuss) 18:34, 24 September 2023 (UTC)

@-sche Yeah I think it's fine. I think people will have some difficulty even distinguishing names from types; Sherman might well be interpreted as a name since it's capitalized, same for brands like 747's. We will probably need some good explanatory text indicating that "types of X" can include more and less specific types (with examples), brands, etc. Benwing2 (talk) 22:02, 24 September 2023 (UTC)

Beautify etymology sections

Note: I'm writing this from the perspective of an English editor so all of these proposals will apply to English only unless there's consensus otherwise.

Assume that the example wikitext is the only text in the Etymology section. Complex etymologies naturally have to be handled on a case-by-case basis. Ioaxxere (talk) 04:18, 22 September 2023 (UTC)

Proposal 1

{{XYZ|en|params}} to From {{XYZ|en|params}}.

Where XYZ represents {{prefix}}, {{suffix}}, and {{affix}}.

Bottable: Yes

Proposal 2

{{XYZ|en|params}} to {{XYZ+|en|params}}

Where XYZ represents {{compound}}, {{bor}}, and {{inh}} and "some text" represents boilerplate text like "From", "Borrowed from", "Inherited from", etc.

Bottable: Yes

Proposal 3

Add periods whenever possible. (overlaps with Proposal 1)

Bottable: Partially

Discussion

  1. Support all. Ioaxxere (talk) 04:19, 22 September 2023 (UTC)
    @Ioaxxere I already have a script that implements Proposal 3, which I have run over all languages. Specifically, it looks for etymology-section sentences beginning with "From " and not ending with a period, with complex logic to handle various templates at the end of the line that auto-include a period. Proposal 2 I have implemented for various languages, but some editors in some languages don't like the plus templates for whatever reason, so I haven't done it for all languages. Proposal 1 I've also done for various languages but not globally. Generally I Support these changes for all languages. Benwing2 (talk) 06:24, 22 September 2023 (UTC)
    Also, when I've done Proposal 1 I've included {{compound}}, {{confix}} and the abbreviated forms ({{pre}}, {{suf}}, {{af}}, {{com}}, {{con}}). Benwing2 (talk) 06:25, 22 September 2023 (UTC)
  2. Support Abstain proposals 1 and 3 for all languages, Oppose proposal 2 for all languages. Thadh (talk) 06:29, 22 September 2023 (UTC)
    Changed to abstain; Convinced by a few arguments below, but I'm fine either way. Thadh (talk) 22:05, 22 September 2023 (UTC)
  3. Abstain on proposals 1 and 3; Oppose proposal 2. The {{bor+}}/{{inh+}} family of templates were created against consensus, are insulting to readers, and ought to be avoided in all cases. —Mahāgaja · talk 06:46, 22 September 2023 (UTC)
    The amount of casual folks I’ve talked to who ask about the difference makes me feel that it’s actually meaningful. Idk why it’d be insulting, and frankly, even if you don’t want the template used, you shouldn’t act like this is a universal thing that’s known :-/ Borrowing + inheritance is not a topic that’s universally known. It’s not fair to compare it to linking “from” as that’s a basic word. Our dictionary is not only used by linguists, and we should remember that and not be condescending towards those that might need links like those. AG202 (talk) 03:25, 23 September 2023 (UTC)
  4. Support for English. Comment: Benwing has a script that takes etymology sections and can change things like {{pre|en|foo|bar}} and add "From ." around it. Also I have to just straight up disagree with Mahagaja, respectfully, on the plus templates. Vininn126 (talk) 06:49, 22 September 2023 (UTC)
    @Vininn126 Yes, I have a very hard time understanding why these templates are possibly "insulting". All they do is templatify the text "Borrowed from"/"Inherited from" and link the terms "borrowed" and "inherited". We regularly link technical jargon to the glossary in headwords and elsewhere. Benwing2 (talk) 06:57, 22 September 2023 (UTC)
    Also there is no rule that says you need a vote to create templates. Benwing2 (talk) 06:59, 22 September 2023 (UTC)
    It does if it's circumventing another vote. --{{victar|talk}} 21:13, 22 September 2023 (UTC)
    @Benwing2 They are also incredibly useful for certain language families such as Romance where there are different ways a Latin word can enter a language. Vininn126 (talk) 07:00, 22 September 2023 (UTC)
    I do agree with Mahagaja completely. Some jargon is so straightforward that spelling it out is demeaning to the reader. Thadh (talk) 11:28, 22 September 2023 (UTC)
    Mmm, yes, the highly technical term "borrowed". We should also create {{af+}} so it produces ]. --{{victar|talk}} 21:13, 22 September 2023 (UTC)
  5. Oppose all three. 1 and 3 add absolutely nothing of value to e.g. simple derivations or compound words, and plus templates should not be the default for all languages. — SURJECTION / T / C / L / 08:56, 22 September 2023 (UTC)
  6. Abstain on proposal 1, Oppose proposal 2, weak support proposal 3. 1 adds nothing meaningful - to me there's not really a difference; for 2, it is stupid to link basic words every time - also this has became a perennial proposal, and definitely needs more discussion on its own (as in separate, instead of being lumped with other proposals like this); 3 should be (and has been regularly) done because the etymology section contains sentences which end in a full stop. – wpi (talk) 11:41, 22 September 2023 (UTC)
    The etymology section may be a fragment rather than a sentence. Adding a full stop to a terse etymology can be uglification rather than beautification. --RichardW57m (talk) 15:16, 22 September 2023 (UTC)
    Indeed that's correct, I should have specified "usually". For things like {{compound}} the full stop is probably unneeded, but it should be added when the etymology section contains a sentence, e.g. From {{der|en|foo|bar}} is ugly without the full stop. – wpi (talk) 16:10, 22 September 2023 (UTC)
    "From Foo bar" isn't a sentence, though; it's a prepositional phrase. —Mahāgaja · talk 20:07, 22 September 2023 (UTC)
  7. Support all. Andrew Sheedy (talk) 15:03, 22 September 2023 (UTC)
    Changing to Abstain for 2, given how controversial it seems to be.
    I might also add that since FL etymologies are formatted the same, I see no reason not to include them. Andrew Sheedy (talk) 17:07, 22 September 2023 (UTC)
  8. Support all. —Al-Muqanna المقنع (talk) 16:17, 22 September 2023 (UTC)
  9. Support all. — Vorziblix (talk · contribs) 21:12, 22 September 2023 (UTC)
  10. Oppose all: not a valid vote. --{{victar|talk}} 21:13, 22 September 2023 (UTC)
    We don't need an entire vote for every bot job in every language. Ioaxxere (talk) 21:50, 22 September 2023 (UTC)
    Rule one with Victar: don't engage in discussion. Vininn126 (talk) 21:51, 22 September 2023 (UTC)
    Sorry, am I harshing your template boner? --{{victar|talk}} 22:06, 22 September 2023 (UTC)
    No, you're just unbearable. Vininn126 (talk) 22:08, 22 September 2023 (UTC)
    💔 --{{victar|talk}} 22:13, 22 September 2023 (UTC)
    I don't think it's particularly valid either through an informal process when it concerns all languages (unless we really do make sure and overtly explicitly clear these apply to English only; anything else would be setting a precedent). — SURJECTION / T / C / L / 22:02, 22 September 2023 (UTC)
    Especially when it's attempting to overturn a previous vote. --{{victar|talk}} 22:06, 22 September 2023 (UTC)
    Isn't that the vote which had no consensus? Because I don't really think that's something that can be overturned. Theknightwho (talk) 09:39, 25 September 2023 (UTC)
  11. Support 1 and 3. I think 1 should be extended to also apply to the templates from P2. Abstain on 2. Chernorizets (talk) 23:15, 22 September 2023 (UTC)
  12. My votes:
    1. Weak Oppose: mostly harmless, but the final period/full stop is technically incorrect.
    2. Oppose I've seen too many cases like {{inh+}} from {{bor}}, {{bor+}} from {{inh}} ({{inh}} and {{bor}} should never occur in the same etymology except in alternate scenarios), {{bor}} from {{bor}} ({{bor}} should only be used once), etc., often with {{der}} in between, alleged borrowing between languages that have never coexisted (there are a few historical terms and learned borrowings that are okay, but bots can't always tell the difference). There are so many editors that don't understand the correct usage for {{bor}}, {{der}} and {{inh}} that there are tons and tons of etymologies with all kinds of subtle mistakes- and simply painting over them like this will just make them harder to spot. I've also seen strange things that bots have converted into {{bor+}} on previous runs.
    3. Weak Oppose. Mostly harmless, but also mostly useless. If someone wants to spend their time on this, I won't revert them, but no need to make it a formal priority. Chuck Entz (talk) 01:16, 23 September 2023 (UTC)
    (Honestly the fact that there are so many *editors* that don’t know how those work or how they interact goes to show that it’s not as intuitive as some folks make it seem. Precisely why we should link to the glossary.) AG202 (talk) 05:20, 23 September 2023 (UTC)
    I don't agree on the full stop being technically incorrect, etymologies are written in an elliptic style and I doubt anyone would prefer writing out "The etymology is" at the start of all of them. At most it's stylistically incorrect according to certain people. —Al-Muqanna المقنع (talk) 08:30, 23 September 2023 (UTC)
    @Chuck Entz But your Oppose to 2 is not based on the difference between {{bor+}} and {{bor}}, but on the misuse of {{bor}} and {{inh}} in general, which is almost completely orthogonal. Can you give specific examples of the things that a bot has wrongly converted to {{bor+}}? My bot has never converted anything but {{bor}} to {{bor+}}. Benwing2 (talk) 20:51, 23 September 2023 (UTC)
    @Benwing2: Not completely orthogonal. My point was that "simply painting over them like this will just make them harder to spot." As for bot errors: I can definitely remember at least one or two of your general cleanup bot runs did some rather complex changes to multiple-line etymologies that resulted in {{der}} with assorted things to the left of it being replaced with {{bor+}} at the beginning of the line. If memory serves (it's been a while), I reverted the few that I found and left it at that, since it didn't happen again. Unfortunately, I don't think I can hunt up any diffs. Chuck Entz (talk) 21:23, 23 September 2023 (UTC)
    @Chuck Entz Right, and I don't understand how a change from {{bor}} to {{bor+}} is painting over anything. Literally, all it does is remove the words "From" or "Borrowed from", and change it to glossary-linked "Borrowed from". As for incorrect bot changes, in the future please do message me rather than just reverting, as I'm not able to see the reverts so I can't know when I need to fix a script. However, I'm pretty sure what went on in the changes you're referring to. Essentially, sometimes I've done a general script-based cleanup and followed it with some manual or semi-manual cleanups. Sometimes these cleanups have included changing {{der}} to {{bor}} when I judged it correct to do so, and it's possible I removed some duplicative etymologies in the process. All such changes are marked "manually assisted", with an overall summary of the changes done. Benwing2 (talk) 00:47, 24 September 2023 (UTC)
  13. Support all, but I know some (especially 2) are a big controversial. tbm (talk) 04:46, 23 September 2023 (UTC)
  14. Oppose all three. I like it when etymologies can be regarded as if a human afforded a thought about them. Fay Freak (talk) 08:40, 23 September 2023 (UTC)
    I don't see how these really change that - furthermore, I find it cumbersome when every etymology has entirely too much detail. Vininn126 (talk) 08:43, 23 September 2023 (UTC)
    Every time we add template requirements we increase the chances that we will subsequently forbid variation. We also create complexity that discourages new contributors, thereby exacerbating the shortage of contributors that we suffer from. We also thereby create a bubble in which our policy-making contributors are increasingly atypical of and often somewhat contemptuous of and even hostile to normal folks. DCDuring (talk) 14:01, 23 September 2023 (UTC)
    The proposed level does not strip the possibility of any variation with etymologies - I follow this and I'm still able to write plenty of different etymologies, particularly for words of obscure origin or some other tidbits, while reducing variation in places where it's unneeded and distracting. I believe the current proposal strikes a balance. Vininn126 (talk) 14:03, 23 September 2023 (UTC)
    Right, it's just a Nudge, the thin edge of the wedge, the camel's nose under the tent. DCDuring (talk) 14:10, 23 September 2023 (UTC)
    w:Slippery slope fallacy Vininn126 (talk) 14:11, 23 September 2023 (UTC)
    I note that your link is a piped link-deception: the actual article is w:Slippery slope, because it is an argument that opponents claim is a fallacy, not inherently a fallacy. I prefer to not be myopic, to take the long view. DCDuring (talk) 15:10, 23 September 2023 (UTC)
    ────────────────────────────────────────────────────────────────────────────────────────────────────I think it's a falsehood to assume that those in support are not being myopic. Furthermore, this does not actually address my counterpoint. Vininn126 (talk) 15:15, 23 September 2023 (UTC)
    Let me venture a long-view prediction. What is beginning as a series of bottable minor changes will become mandatory for user input on the grounds that no one could be bothered to run the bots to catch non-conforming raw user input. Additionally, there will be filters to exclude disapproved input. These steps will serve to further discourage normal folks from persisting in efforts to become contributors. And all of this in the service of "beautification", which means concealing the evidentiary weakness, speculation, and other rough edges in etymologies. DCDuring (talk) 16:24, 23 September 2023 (UTC)
    w:Slippery slope fallacy. You are assuming these have to happen because this happens. This is a classic fallacy, DC. Vininn126 (talk) 16:27, 23 September 2023 (UTC)
    I see you persist in your deceptive link to a redirect.
    "Cause" is a mechanistic concept. What we have is an instance of an evolutionary trend to rigidify Wiktionary in the direction of WP, but lagging and diverging in a way more fitting a dictionary. At WP there are lots of projects to overcome the rigidification that lead instead to more rigidification. Our path to rigidity is by templatization, our main way of achieving rigidity that protects our often-weak entries from edits by newbies that usually indicate problems with entries and sometimes even leads to desirable changes. I'm not so sure about what drives bureaucratization at WP, but here rigidification seems driven by either elitism or the pursuit of bright shiny objects. DCDuring (talk) 16:44, 23 September 2023 (UTC)
    I don't disagree that this thread is an increase in rigidization - I disagree with the idea that it has to lead to more and that the proposed level is too rigid. I have stated this many times now. Vininn126 (talk) 16:47, 23 September 2023 (UTC)
    @DCDuring Why do you keep calling that piped link a "deception" when it's a very common thing to call a slippery slope? You're bordering on uncivil, and seem to be using irrelevant pedantry to derail the point. Theknightwho (talk) 18:04, 23 September 2023 (UTC)
    Also just to point out that the very first line of that Wikipedia article begins A slippery slope fallacy, in logic, critical thinking, political rhetoric, and caselaw, is ..., so if anyone's being deceptive here it's you. Theknightwho (talk) 18:56, 23 September 2023 (UTC)
    It is a form of presenting an argument. Whether or not the presentation of argument is fallacious does not rest on it but the experiences substantiating its likelihood. A correlation can also be a causality and presenting a correlation as of a certain causality is no paralogy if not the presentation itself is the argument, only in such a case we speak of a fallacy. But on Wikipedia I am not generally sure to which extent they allow logical thinking, as opposed to social proof. The decrial of synthesis at least is prima facie support for the suspicion that every long article on epistemological matters contains contradictions by, paradoxically, synthesizing incompatible sources, to say nothing about selection bias and first come, first served anchoring effect producing belief perseverance in favour of whatever article has achieved a long-winded and long-sourced, though contradictory argument. For one, not even the German Wikipedia article in their interwiki claims it a fallacy but specifically “a designation for an argumentation wise or rhetorical technique”. Fay Freak (talk) 20:26, 23 September 2023 (UTC)
    ──────────────────────────────────────────────────────────────────────────────────────────────────── That is not the title of the article and the article does not state or imply that all slippery-slope arguments are fallacious. The slippery-slope accusation is a great way to ignore any consideration of the longer-term consequences of a particular move. Of course, the long-term consequences may be desired by the advocates of the short-term move, but the discussion of those would not necessarily be the best way to get the short-term move through. DCDuring (talk) 21:53, 23 September 2023 (UTC)
    @DCDuring Oh yeah, it merely leads by calling it the "slippery slope fallacy", and explicitly calls one of your idioms ("thin end of the edge") fallacious.
    I could posit my own theories about why you have such cantankerous, dramatic reactions to proposals for minor changes like this, but it would be just as unhelpful as your own speculation about supporters being attracted to "beautification" (which you seem to focus on irrespective of the actual reasoning being given). This is far from the first time you've done this, but please make it the last.
    For the record, I'm not voting either way on this. Theknightwho (talk) 22:08, 23 September 2023 (UTC)
  15. Question I cannot understand the import of Proposal 2- how might it affect, for instance, Etymology 2 of Syu and Etymology 1 of Xu? Thanks for any help on this. --Geographyinitiative (talk) 18:12, 23 September 2023 (UTC)
    @Geographyinitiative Xu wouldn't be changed since using {{bor+}} would break the flow of the etymology. On Syu, Etymology 2 would be converted to {{bor+|en|cmn|許|tr=Xǔ}} {{bor|en|cmn-tongyong|-}} romanization: ''Syǔ''. Ioaxxere (talk) 00:46, 24 September 2023 (UTC)
    I had no problem with and then implemented Ioaxxere's above suggestion on the Syu page--- there were no issues, so Proposal 2 is seemingly potentially compatible with what I'm doing with etymologies, though I am afraid that something valuable may get deleted somewhere. Geographyinitiative (talk) 09:15, 24 September 2023 (UTC)
  16. Support all. Einstein2 (talk) 18:19, 23 September 2023 (UTC)
  17. Oppose 2. I feel like templates are being shoved down my throat. PUC20:27, 23 September 2023 (UTC)
  18. Oppose Oppose for 1, Support for 2, and Oppose for 3. ~ Blansheflur 。・:*:・゚❀,。 01:02, 24 September 2023 (UTC)
  19. Support all, but i'm a bit on the fence سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 05:19, 24 September 2023 (UTC)
  20. Support proposals 1 and 3, but Oppose proposal 2 because it effectively attempts to reverse “Wiktionary:Votes/2017-06/borrowing, borrowed”, so I think a formal vote is required for that. — Sgconlaw (talk) 05:36, 24 September 2023 (UTC)
  21. Oppose the usage of {{com+}}, the rest is fine. Catonif (talk) 13:39, 24 September 2023 (UTC)
  22. Under Proposal 2, will the 'boiler plate' being absorbed have to match the template? There seem to be several instances where simple 'From' is used with {{inh}} because reshaping has happened, sometimes only for some inflected forms, as in the convoluted development of finir from Latin fīniō. In this case it seems rather a call for review, as 'Etymology' and 'Descendants' sections tell different stories. --RichardW57m (talk) 11:07, 25 September 2023 (UTC)
  23. Oppose all. DCDuring (talk) 01:25, 22 October 2023 (UTC)

@Benwing2 It seems like there's consensus to enforce proposals 1 and 3 for English, so you can activate your scripts. Ioaxxere (talk) 19:35, 21 October 2023 (UTC)

@Ioaxxere Could someone count up the yeas and nays by proposal? I'm too tired to take a run at it myself now. DCDuring (talk) 01:28, 22 October 2023 (UTC)
@Benwing2 any update on this? Obviously, there's no rush but I hope it hasn't been forgotten. Ioaxxere (talk) 21:07, 13 December 2023 (UTC)
@Ioaxxere Not forgotten but can you list out the reasons for consensus? I want to make sure I don't end up doing something without consensus. Benwing2 (talk) 21:21, 13 December 2023 (UTC)
@Benwing2 Well, aside from a simple count it seems like a couple of people mistakenly thought the proposals would apply to all languages, or that they would become formal policy, and voted based off of that. Also, I don't think anyone brought a good argument against the proposals that would outweigh the advantages. It's also worth noting that the majority of entries that would be affected by the proposals were created by User:Equinox and hardly touched since (see e.g. unhoaxable). I guess we could change our minds at some point and remove every "from", but I doubt this will ever happen. Ioaxxere (talk) 22:07, 13 December 2023 (UTC)
@User:Ioaxxere, @User:Benwing2 If a proposer makes specific distinct proposals that are put to a poll, it seems that the proposer did not think any one of the proposals was much better that the others. As this looks like an informal vote, normal voting rules would apply. It doesn't matter whether the proposer or anyone else isn't impressed by the arguments made. The vote outcome is what matters.
If there has been confusion, perhaps, after a decent interval, say three weeks (after New Year), we could have a new poll or an actual vote on proposals that address the confusion and any objections, drawing from this discussion one or two improved proposals that have the best chance of earning a consensus. DCDuring (talk) 23:27, 13 December 2023 (UTC)
@DCDuring I don't think the second proposal has much of a chance in any scenario, so I don't see any reason to revive this discussion. I agree that the vote outcome is what counts, but User:Benwing2 specified "reasons" for consensus, which made me think that he was taking a w:Wikipedia:Polling is not a substitute for discussion perspective. Hopefully I didn't misinterpret that. Also, I should point out that if "normal voting rules" do indeed apply then you made your vote too late. Ioaxxere (talk) 01:10, 14 December 2023 (UTC)
@Ioaxxere Not taking any particular perspective, it's just that you asserted that there seems to be consensus for proposals 1 and 3 for English without specifying why; I haven't looked over the individual comments and votes so IMO it would be helpful to specify how many yeses and nos there are for proposals 1 and 3, and how important the no votes are (e.g. some of them may be "don't really care"s). Benwing2 (talk) 02:06, 14 December 2023 (UTC)
@Benwing2 Okay, by an inclusive count proposal 1 is 12/6/3 and proposal 3 is 13/6/2. That includes a couple of questionable votes- User:Victar opposing on purely procedural grounds, User:Catonif saying "fine" without the {{support}} template, and User:DCDuring voting late. By an exclusive count proposal 1 is 11/4/3 and proposal 3 is 12/4/2. But I never intended the discussion as a formal vote. Ioaxxere (talk) 03:02, 14 December 2023 (UTC)
What makes my vote "late"? Was there a posted date for closing this discussion. If it was supposed to be a formal vote for which lateness mattered, it was not done properly. DCDuring (talk) 00:52, 16 December 2023 (UTC)

Making formatting of trans-top more consistent

I noticed an inconsistency in the way labels in {{trans-top}} are formatted. Some use (label) description) whereas some use italic (''label'') description). The former is more common (~1400 vs 400), although the latter seems more in line with {{lb}}. (There were also a handful that had a colon after the parentheses but I fixed those since they were clearly inconsistent.)

Actually, the most common way is label: description without any parentheses (~5700 according to a quick search, although not all of them might be labels.) Again, I think the parentheses look nice (and are in line with {{lb}}) but that's just my personal preference.

Could we add a label/lb paramater to {{trans-top}} and format these in a consistent way or what would a good solution be? tbm (talk) 03:31, 23 September 2023 (UTC)

@tbm: What about simply using {{q}}? (Personally I prefer the colon.) —Fish bowl (talk) 07:49, 23 September 2023 (UTC)
Template syntax shouldn't be used in {{trans-top}}, because that breaks the translation adder and there is no practical way to make it support this case either. — SURJECTION / T / C / L / 13:28, 23 September 2023 (UTC)
I support formatting them as (label) gloss, however we decide to do that. I used to not italicize them, but I've come to prefer the italics, for consistency with {{lb}} and because it provides a clearer visual distinction between qualifiers and glosses. I'm willing to sacrifice my preference for the sake of consistency, however. Andrew Sheedy (talk) 01:19, 24 September 2023 (UTC)

Template naming for punctuation characters

I found the {{punctuation}} template (which lists the various punctuation characters) and thought that would be nice for Swahili. I was looking for similar templates for other languages and noticed very inconsistent naming: {{vi-punctuation}} and {{list:punctuation/lv}}.

What would be the best way to make this consistent? Convert all to {{list:punctuation/LANG}} as lists: is the way this is done for other things (countries, months, chapters of the bible, etc)? tbm (talk) 03:38, 23 September 2023 (UTC)

I support integrating these into the list template system. — SURJECTION / T / C / L / 09:53, 23 September 2023 (UTC)
Yeah, ideally these should probably be incorporated into at least the list: naming scheme, and if possible also list:-template-style formatting. We still have a lot of templatized "lists of entries" at names outside the list: format, sometimes because they're so old they predate it (Template:coefficient) or for other reasons (e.g. Template:coloured flags). - -sche (discuss) 13:00, 23 September 2023 (UTC)
@Surjection @-sche It took me a while but I finally renamed {{punctuation}} to {{list:punctuation/en}} and {{vi-punctuation}} to {{list:punctuation/vi}}. Thanks for your feedback! I see there's also {{typography marks}} but I don't have time right now. tbm (talk) 07:19, 7 December 2023 (UTC)
@Surjection @-sche Never mind. {{typography marks}} was easy enough so I renamed that too. tbm (talk) 07:28, 7 December 2023 (UTC)

Station codes

Is a category for railway station codes (such as HWH) Category:Station codes worth creating, or is it sufficient to list them in CAT:Rail transportation? ·~ dictátor·mundꟾ 21:44, 23 September 2023 (UTC)

I don't think it's helpful to list them together with other railway-related vocabulary, so I think it's better to create it as a subcat of the main category. Whatever we do, though, it should be consistent with our treatment of airport codes. Andrew Sheedy (talk) 00:29, 24 September 2023 (UTC)
@Andrew Sheedy @Inqilābī If you do categorise them, they should be subdivided by classification system. It makes no sense to have British and Indian railway codes all mixed up together, for example. Theknightwho (talk) 00:41, 24 September 2023 (UTC)
@Andrew Sheedy: I’d beg to differ here a bit, airport codes are used internationally and are regulated by an international body, the IATA. Whereas, the alphabetic railway station codes aren’t used ubiquitously, they are used and regulated locally within the country, and are used mostly in places which used to be part of the British Empire (Britain, South Asia, Hong Kong); other countries also have station codes but those are numeric. Thus I’m treating them as English entries rather than Translingual for their strong linguistic association with English language and non-universal usage.
@Theknightwho: Do you mean classification into countries? Our airport codes are themselves listed unclassified. ·~ dictátor·mundꟾ 07:57, 24 September 2023 (UTC)
@Inqilābī Yeah - I say “system” because I don’t know if some countries share one (I have a feeling some European countries do). Theknightwho (talk) 12:35, 24 September 2023 (UTC)
Oh, OK, makes sense. Linguistically/lexicographically, I think they tend to work quite similarly, so my main concern is that they not be treated in incongruent ways. Andrew Sheedy (talk) 16:47, 24 September 2023 (UTC)
After the creation, I record my agreement. DonnanZ (talk) 13:41, 24 September 2023 (UTC)
@Inqilābī Would they not be better as Translingual? Theknightwho (talk) 19:11, 24 September 2023 (UTC)
Are they used translingually? Spot-checking a few, including Indian ones which might be expected to also be attested in India's other languages, I only spotted occurrences in English. (If someone can demonstrate that such codes are generally translingual, I have no objection to treating even ones that happen to be attested only in English as Translingual, a la species names that happen to only be attested in English.) - -sche (discuss) 20:18, 24 September 2023 (UTC)
@Theknightwho: As I stated above, these railway codes seem rather English-specific, limited to a handful number of places where English is an official language (the British introduced the alphabetic station code system in Britain and its colonies when they introduced rail transportation), and these are abbreviations of English terms, or those of English transliterations or Anglicized names for non-English names. The station code of Berlin Hauptbahnhof is 1071, that of Moscow Leningradsky railway station is 060073, and that of Shanghai railway station is 30671; these numeric station codes do look like something that can be shown as Translingual (if they pass CFI). But I don’t think the station codes of Bangladesh, Britain, Hong Kong, India, Pakistan need to be Translingual entries, and feels better as English lemmas— just like CAT:en:Organizations. ·~ dictátor·mundꟾ 20:51, 24 September 2023 (UTC)
@Inqilābī @-sche I'll admit that I suggested "Translingual" when thinking of Indian railway codes in particular, but even in the UK there are timetables published in other languages. I don't just mean for tourists, either: you can buy UK rail tickets in Welsh here, for example, even if you're travelling nowhere near Wales. Note the dropdown list when you type in a station name: "Caerdydd Canolog (CDF)", which translates as "Cardiff Central (CDF)". It just seems like Translingual is the more natural option, but I'm not hugely fussed about it. Theknightwho (talk) 15:48, 4 October 2023 (UTC)
The numeric codes are basically arbitrary and do not strike me as "language" for a dictionary. A list of things and the numbers that represent them is a codebook. (I note we don't even include UK postcodes like RG for Reading.) Equinox 15:51, 4 October 2023 (UTC)

Moratorium on Single-Letter Entries

Back in WT:Beer parlour/2023/July#Changing Translingual to a Specific Language, @Benwing2 decreed:

@Kwamikagami, RichardW57 Can we (a) please move this discussion to the Beer Parlour, where it belongs, and (b) not edit war while this discussion is going on? It would be best to not touch any existing single-letter entries until some sort of consensus is reached, even if this takes awhile. Benwing2 (talk) 23:43, 16 July 2023 (UTC)

1. Can we agree that except in the overriding case of language mergers and splits, and in reversals of unauthorised changes, the normal process to change the potentially valid language of an entry is to raise the matter in {{rfm}}? This allows for changes via {{rfv}} and {{rfd}} as part of their clean-ups.

The reason I am hesitant is that the pipe-cleaning request at WT:RFM#อ‍ย has stalled. I fear we need to nominate another test case.

2. In the absence of this constraint, may we edit entries for non-letter like terms on the same page, and also make changes to whether PoS lines are at L2 or L3?

3. May I change the Mon section of က so that we no longer claim that the letter comes from a Proto-Mon-Khmer word for 'fish'. --RichardW57 (talk) 11:03, 24 September 2023 (UTC)

@RichardW57 All of this sounds fine, I think. BTW I don't know what Kwami has been up to lately. Benwing2 (talk) 02:34, 25 September 2023 (UTC)
@Benwing2: But do I have permission to proceed, subject to the usual requirement not to be disruptive? --RichardW57 (talk) 23:02, 25 September 2023 (UTC)
I'm here, just don't have much access to the internet. kwami (talk) 02:11, 26 September 2023 (UTC)
@Kwamikagami Thanks, and hope everything is well. @RichardW57 I think as a general rule, any changes to single-letter entries that are unrelated to changing them between Translingual and some specific language, or to deleting them, are fine. Use your best judgment here. Benwing2 (talk) 02:26, 26 September 2023 (UTC)

Malagasy Wiktionary

Jagwar-bot seems to be up and running again, with a complete disregard for the decision that was made a few years ago. Can something be done about this? PUC14:34, 24 September 2023 (UTC)

@PUC: Basically: No. The bot gives a disclaimer now that the information has been automatically generated from the content on the English Wiktionary, and it seems to only convert sections that are added by some editors, but not others. The only thing we can do is push for a closure of the Malagasy edition of Wiktionary altogether, but we can't force it to accept a certain policy. Thadh (talk) 14:54, 24 September 2023 (UTC)

Clarification on Learned Loanwords

This question is brought up by complaints for about the proposal to add glossary links to some etymologies. The glossary entry for 'learned borrowing' says. "learned loanword that was borrowed from a classical language such as Latin, Ancient Greek or Sanskrit, and which has not undergone significant reshaping due to sound change or analogy with inherited terms." Does apocope plus the Great Thai consonant shift constitute "significant reshaping due to sound change", e.g. /thoːt/ v. ⟨doṣa⟩. (The spelling change is just a mechanical script change that has since been retconned out.)

(The main issue with the term is that it does not mean 'loanword that is learnèd'.) --RichardW57m (talk) 12:08, 25 September 2023 (UTC)

@RichardW57m The definition of "learned borrowing" here is intended to distinguish it from semi-learned borrowings, which are common in the Romance languages, in Hindi, etc. However, I think the example you give is actually an example of an orthographic borrowing, since the spelling is presumably the same as the Sanskrit original but the pronunciation has shifted significantly. Benwing2 (talk) 02:42, 26 September 2023 (UTC)

Making non-ogt into its own L2.

Old Gutnish is a medieval North Germanic dialect at the same level as Old Danish and Old Swedish. It should thus be an L2. The alternative would be merging all of these under Old Norse as dialects and make the Old Norse normalization used on WT more archaic than the current one, which is typical C13th Icelandic. That seems rather unlikely and so Old Gutnish should be made an L2, maybe with new code gmq-ogt. ᛙᛆᚱᛐᛁᚿᛌᛆᛌProto-NorsingAsk me anything 18:07, 25 September 2023 (UTC)

Done DoneSURJECTION / T / C / L / 08:49, 4 October 2023 (UTC)

Etymological code for Scythian language

Can a code for the Scythian language be added to the etymological languages? I understand that there is already a code present for the broader Scythian languages family, but I need a code for the Scythian language proper itself in addition to the language family. Antiquistik (talk) 23:42, 26 September 2023 (UTC)

@Antiquistik interestingly, the code xsc in theory is just for "Ancient Scythian", whatever that refers to, not for a family. I don't know much about the subject, but are you sure there was a single Scythian language in the past? E.g. we have Old Ossetic (code 'oos') - is that different? Chernorizets (talk) 02:02, 27 September 2023 (UTC)
@Chernorizets There were several languages in the Scythian languages family. Among the western "branch" of this family, a "Scythian" language (spoken by the Pontic Scythians proper and possibly by the historical Cimmerians) and a Sarmatian language, both distinct from each other, have definitely been identified. If a code for Sarmatian could be created too, that would be great, but for now what I really need is a code for Scythian proper. Antiquistik (talk) 02:25, 27 September 2023 (UTC)
Anthropologically, Scythian catch-all term, often used to refer to both the Saka and the Alans. Linguistically, it's the language family that contains both their languages. Western Scythian are the Ossetic languages, including Alanic xln. If you're talking about the Cimmerians, we assume they were culturally Scythian, but we have no idea about their language, and if it was even Iranian. You can use language code und for undetermined languages, but I wouldn't support any Cimmerian reconstructions. --{{victar|talk}} 05:16, 27 September 2023 (UTC)
@Victar I mostly agree, although there seems to be an growing body of work conserning the Cimmerians that increasingly regards them as very likely having a strong East Iranic component. Therefore I think we should be prudent without altogether rejecting the possibility of Cimmerian reconstructions.
My main issue, however, is more with how the only codes available for Scytho-Sakan languages are for xsc-pro (Proto-Scythian) and xsc-sak-pro (Proto-Saka). These are useful for the more archaic phases of the languages, but not for the more derived phases. For example, if I wanted to add the three main etymologies proposed in this paper to Wiktionary, the very presently available codes wouldn't permit me to do so.
Additionally, with how Scythologists like, the both recently deceased, Sergey Tokhtasyev and Sergey Kullanda had been establishing that Scythian (extinct), Sarmatian (ancestral to Alanic and Ossetian) and Saka (ancestral to Khotanese, Tumshuqese and certain Pamiri languages) had enough differences to consider them as different languages, I don't think it's tenable to continue using "Scythian" as a catch-all term and we would need etymological codes for these three languages specifically sooner or later. Antiquistik (talk) 05:34, 27 September 2023 (UTC)
Create a sandbox of the "three main etymologies" you're referring to, and I can give my recommendations on how they should be handled. I would also direct you to the family tree on CAT:Proto-Scythian language so you might take note of the Ossetic branch and its language codes. --{{victar|talk}} 06:03, 27 September 2023 (UTC)
@victar I'll try. Although I would need your assistance prior to that, because some of the entries in {{R:inc:IAIL}} that I require to create the reconstruction are absent from the only copy of this now offline database that I have at my disposition. Antiquistik (talk) 07:26, 27 September 2023 (UTC)
@victar I apologise for inadvertently missing your point that "Western Scythian are the Ossetic languages, including Alanic xln" in my previous reply.
Roland Emmerick is correct to point out that "the languages of the Scytho-Sarmatian inscriptions may represent dialects of a language family of which Modern Ossetic, an East Iranian language, is a continuation, but it does not simply represent the same language at an earlier date": equating Ossetic with all of the Western Scythian branch is that this is an older understanding of the relationship between Western Scythian and Ossetic that has made obsolete during the last three decades thanks to Sergey Tokhtasyev's studies on the Scytho-Sarmatian languages.
Western Scythian is now understood to have had comprised of two main languages:
  1. Scythian proper or Pontic Scythian, spoken by the Scythian people fitting the narrowest definition of the term, i.e. the population who lived in the Pontic Steppe (more or less corresponding to present-day Ukraine's territory) between the 8th and 3rd centuries BCE. This language is fully extinct and has no known descendant languages.
    • The defining feature of Scythian proper or Pontic Scythian was the sound shift from Proto-Iranian /d/ to Proto-Scythian /δ/ to Scythian /l/.
      • Examples of this include *Skuδatā- (whence Σκύθαι) to *Skulatā- (whence Σκολότοι); *Paraδāta- to *Paralāta- (whence Παραλάται)
  2. Sarmatian, spoken by the various groups including the Iazyges, Roxolani, Aorsi, Siraces, and Alans. This language is the ancestor of the Ossetic languages.
    • Pontic Scythian's transition from /d/ to /δ/ to /l/ was absent from Sarmatian, whose defining feature was the transition from Old Iranian /ry-/ to Middle Iranian /l-/.
      • Examples of this include *Rauxšna-aryana-/Rauxšnāryana- to Roxšnālan(a)- (whence Ῥωξολανοί); *Aryana- to *Alan(a)- (whence Άλανοί)
This also ties in with my issue with the three main etymologies: there is a transition from Proto-Iranian /d/ to Proto-Scythian /δ/ to Scythian /l/ and another one from Proto-Iranian /ry/ to Sarmatian /l/, that, in my opinion, the presently available etymological codes for Scythian do not allow to cover, as you can see with the three sandboxes:
Antiquistik (talk) 04:30, 29 September 2023 (UTC)
@Antiquistik: These would never fly, because what you have are three Greek terms borrowed from Scythian. They should be created an Ancient Greek entries, with a hypothesis to their origin in the etymology. --{{victar|talk}} 23:29, 30 September 2023 (UTC)
@Victar Wiktionary does accept reconstructed terms from extinct languages, though, no?
But, even so, the Ancient Greek entries would need to include both the Proto-Scythian forms *Dipoxšayaʰ/*Δipoxšayaʰ and *Kaudaxšayaʰ/*Kodaxšayaʰ in addition to the Scythian proper/Pontic Scythian forms *Lipoxšayaʰ and *Kolaxšayaʰ. Otherwise, their etymologies cannot be explained.
And for this, a code for the Scythian proper/Pontic Scythian language is required in addition to the already existing one for Proto-Scythian. Antiquistik (talk) 08:03, 1 October 2023 (UTC)
@Antiquistik: The issue is not with codes. Depending on the time period, one would use either xsc-pro (Scythian), os-pro (Sauromatian), or oos (Alanian, Sarmatian). See User:Victar/Timelines/Scythian_loanwords. The problem with onomastics is that their etymological certainty is often very low, and in these cases, extremely so. Creating entries for these reconstructions is far too speculative. --{{victar|talk}} 17:54, 1 October 2023 (UTC)
@Victar: A bit off-topic, but Proto-Permic is usually dated up to the 9th century, with Proto-Komi spanning one or two centuries after that. I don't know where you got the date 1372 from, but any borrowing after 1100 will be considered a borrowing into an individual Permic language. Thadh (talk) 23:02, 1 October 2023 (UTC)
I think I dated it to the first texts of Stephen of Perm. Not really my area of expertise so didn't look too much into it, but thanks, I might update it. --{{victar|talk}} 00:14, 2 October 2023 (UTC)
I don't disagree with your point if it regards creating entries for the Scythian reconstructions themselves. But do you also mean that in regards to creating Ancient Greek entries with etymological hypotheses?
I nevertheless disagree regarding the importance of the codes, especially with this point: "Depending on the time period, one would use either xsc-pro (Scythian), os-pro (Sauromatian), or oos (Alanian, Sarmatian)."
Pontic Scythian/Scythian proper and Sarmatian (to which belongs Ossetic, i.e. Alanic and Ossetian) are two different branches of the same language family, and are significantly different enough to make it problematic form them to be used as interchangeably as this.
I have already addressed this in the earlier part of my second previous reply, and the problems that I have pointed out will persist even if I were to instead create Ancient Greek entries with etymological hypotheses for these, as you can see:
And these problems affect not only these three somewhat speculative etymologies, but also the more established and more universally accepted ones, like:
  • Pontic Scythian:
    • Skuδatā (Σκύθαι) → Skulatā- (Σκολότοι)
    • and Paraδāta- → *Paralāta- (whence Παραλάται),
  • and Sarmatian:
    • Rauxšna-aryana-/Rauxšnāryana- → Roxšnālan(a)- (Ῥωξολανοί)
    • and *Aryana- → *Alan(a)- (whence Άλανοί).
Antiquistik (talk) 03:08, 2 October 2023 (UTC)
I feel like there is some language barrier in our communication. Sandbox entries are very helpful.
The Germanic people spoke Proto-Germanic. They didn't speak Germanic, regardless of their tribe or dialect. So these Ancient Greek terms you link above would be derived from Proto-Scythian, not Scythian. Scythian is not an attested language. --{{victar|talk}} 18:02, 2 October 2023 (UTC)
You're also not reconstructing Proto-Scythian properly. Proto-Iranian *Kawdáxšayah would render PScy. *Kōδáxšēi > Sauromatian Proto-Ossetic *Kūδáxseɨ, *Kūláxseɨ. Ancient Greek Κολάξαϊς (Koláxaïs), however, is usually reconstructed as being from *Hwaryakšayah, so both etymologies should be mentioned. --{{victar|talk}} 01:39, 3 October 2023 (UTC)
@victar It seems that I have also mis-explained what I meant.

When I say "Scythian proper/Pontic Scythian," I am not equating the Scythian language family with a single language. What I mean is the language of the Scythian people as narrowly defined, that is the language of the people who inhabited the Pontic Steppe/the territories now mostly constituting Ukraine between the 7th and 3rd centuries BCE. This "Scythian proper/Pontic Scythian" language appears to be indeed attested as only one member of the much larger "Scythian languages" family, and is not synonymous with the Sauromatian language.

I don't mean the broader "Scythian languages" of the broader "Scythian cultures" that this specific tribe was part of.

The derivation of Koláxaïs from *Hwaryakšayah is itself an older view, which considered the both the historical Sauromatians and the historical Pontic/"Ukrainian" Scythians as speaking the same language.

The more recent studies over the course of the 90s, 2000s and 2010s by linguists specialising in Scythian languages, such as Sergey Kullanda, Sergey Tokhtasyev, and Mikhail Bukharin, have instead shown that there were two "western" Scythian languages"
  1. Scythian proper or Pontic Scythian, spoken by the Scythian people fitting the narrowest definition of the term, i.e. the population who lived in the Pontic Steppe (more or less corresponding to present-day Ukraine's territory) between the 8th and 3rd centuries BCE. This language is fully extinct and has no known descendant languages.
    • The defining feature of Scythian proper or Pontic Scythian was the sound shift from Proto-Iranian /d/ to Proto-Scythian /δ/ to Scythian /l/.
      • Examples of this include *Skuδatā- (whence Σκύθαι) to *Skulatā- (whence Σκολότοι); *Paraδāta- to *Paralāta- (whence Παραλάται)
  2. Sarmatian, spoken by the various groups including the Iazyges, Roxolani, Aorsi, Siraces, and Alans. This language is the ancestor of the Ossetic languages.
    • Pontic Scythian's transition from /d/ to /δ/ to /l/ was absent from Sarmatian, whose defining feature was the transition from Old Iranian /ry-/ to Middle Iranian /l-/.
      • Examples of this include *Rauxšna-aryana-/Rauxšnāryana- to Roxšnālan(a)- (whence Ῥωξολανοί); *Aryana- to *Alan(a)- (whence Άλανοί)

The proper family tree of the Scythian languages appears to instead be:
I don't have access to the totality of this research, but these three studies, by Sergey Tokhtasyev and Mikhail Bukharin are important in this regards. There are two more important studies concerning this research on the Scythian languages, but I would need to email them to you if you want to read them as I cannot link to them on this discussion space. However, most of this research is in Russian and Ukrainian, so you will need to translate them if you don't speak either language. Antiquistik (talk) 15:33, 3 October 2023 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── We're going in circles, and your tree is wrong in several ways (have you not seen CAT:Proto-Scythian language?), but I'm going to leave you with this: Pontic/European/Western Scythian is an anthropological term for peoples of a region. The linguistic term we're using for the language spoken in this region around the 5th century BC to the first century AD is called Proto-Ossetic, which includes the dialects of Sauromatian and those of the Scythians of the Pontic Step spoken of by Herodotus. The phonological change /δ/ > /l/ is a dialectal feature, not a marker of a completely different language. Also just because one etymological theory is newer than another, doesn't inherently make it more correct -- both should be mentioned. When/if you create Ancient Greek entries, ping me so I can go over them. --{{victar|talk}} 03:11, 4 October 2023 (UTC)

@victar I have seen the family tree, and I find it very problematic. See Ronald Emmerick's comment that the languages of the Scytho-Sarmatian inscriptions may represent dialects of a language family of which Modern Ossetic, an East Iranian language, is a continuation, but it does not simply represent the same language at an earlier date. Equating Proto-Ossetic with all the languages/dialects of the various western "Scythian peoples" is itself extremely inaccurate understanding of the relationship between Ossetic and the language of the Pontic Scythians.

As to whether /δ/ > /l/ is merely a dialectical feature or a marker of a whole other language, I find it preferable to rely on the linguists who have studied the issue rather than arguing on a forum, and the current understanding of the Scythian languages by Scythologists is that Ossetic is descended from the Sarmatian language, which was a sibling-language of the language spoken by the Scythians of the Pontic Steppe. Regarding the validity of research, I think it's safe to say that what you said also goes the other way round, and that just because certain theories are older it doesn't mean that they are permanently established: we can't allow ourselves to be too rigid, older hypotheses do become obsolete and are replaced by newer ones all the time in all fields of scientific research, including linguistics, and we need to stick to what is best backed by evidence.

Given that our argument is going in circles, I would encourage you to read the sources I have linked if you want to understand why I am disagreeing with you. And if you wish to read those as well, I can also email you the two other studies on the Scythian languages that I cannot link to in this discussion. Peace out. Antiquistik (talk) 06:47, 4 October 2023 (UTC)

Process for editors with declared conflicts of interest

An editor (User:StobbsOBE) arrived recently who has declared that they work on behalf of clients to protect trademarks. They were blocked, but they also seem like they are willing to be above board and attempt to work with the community to both do their job and respect our policies and practices. I proposed on their talk page the idea of requesting edits rather than making edits on any page where they may have a conflict of interest, which they seemed willing try. Is this an agreeable solution to the community? If so, what is the best way to request an edit and advocate for its merit? The talk page of the entry in question is obviously the best place for such discussion, but it is very unlikely to be seen there. Can the discussion take place in RFC or RFV, and then be moved to the talk page when completed? - TheDaveRoss 17:06, 29 September 2023 (UTC)

Setting aside the fact that this edit was let's say botched(?), I think there is no problem with someone editing with a declared COI. I find it unlikely that the community here is interested in the same overhead as at en.wp, but at the very least, requested edits seems entirely reasonable. As for the venue, we do have a number of templates in Category:Request templates and could 1.) make a new one or 2.) repurpose an old one for this kind of request. —Justin (koavf)TCM 20:24, 29 September 2023 (UTC)
I'm not sure, on one hand I think it's inappropriate for an editor to join a project for the sake of a 3rd party... but I think it's better that editors who have conflict of interests announce it rather than keeping it a secret... I think I'll wait for more input before I make a hard decision on how I feel about this. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 20:40, 29 September 2023 (UTC)
What's the problem with above-board third party editing? Maybe Person A edits on behalf of Person B who is illiterate or who has a medical condition or who just doesn't "get" technology or who is too young to edit directly, etc. —Justin (koavf)TCM 20:42, 29 September 2023 (UTC)
I think an editor writing on behalf of someone with disabilities is not at all comparable to an editor writing on behalf of McDonalds... but that's just my opinion. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 20:46, 29 September 2023 (UTC)
Although I tend to be pretty hard-line about COI, I think this could be easily implemented with a mainspace- citation- and appendix-space-only block, which would allow access to forums as well as entry and user talk pages. I suppose the matter of RQ templates might necessitate restrictions on templates and modules as well in some cases.
I think it would be a good idea to develop an informational template for user talk pages and a Wiktionary-namespace page laying out the rules. We need to be especially clear about how a descriptive dictionary has to go with usage even when usage is unfair or wrong. We should have clear and enforceable procedures and ground rules completely spelled out beforehand.
The main reason I'm suggesting blocks (with an appropriate message indicating it's not because of wrongdoing) is that we don't have enough admins to monitor for compliance with COI policies in addition to the usual patrolling. Better to have such edits vetted beforehand. Chuck Entz (talk) 22:38, 29 September 2023 (UTC)
Other than pointing out errors in citations that we in effect assert show generic use of a trademark, how is this kind of contributor going to help the project? Would we want to have such contributors insert templates with text drafted by attorneys to warn people that using a word generically will attract the attention of the trademark police? I think trademarks are property worth protecting, but not by ignoring genericization or by carrying water for the owners of trademarks. DCDuring (talk) 23:28, 29 September 2023 (UTC)
I would think it is useful information to include in entries when a term is a protected trademark in some jurisdiction. Perhaps a good example is Slurpee, which appears from our entry to be entirely trademark free. In the real world my hunch is that this is only "generic" in the sense that very occasionally people refer to similar such drinks in casual conversation as Slurpees, but is in no meaningful way a genericized trademark like dumpster. - TheDaveRoss 12:30, 2 October 2023 (UTC)
I don't see how we can provide accurate information on the claimed validity of trademarks. Should we be inserting definitions that fit the trademark use of words like Bold and Tide? I'm guessing that other dictionaries only insert some kind of indication that a term is used as a trademark in response to letters from trademark owners' attorneys. DCDuring (talk) 23:32, 2 October 2023 (UTC)
I would think we wouldn't if there was no sense related to the trademark, so Bold and Tide would not mention the products or brands or trademarks at all. However in entries where we tacitly or explicitly assert that a term is generic despite knowing that the term is an actively protected trademark in some jurisdiction, we could include some information to that effect. - TheDaveRoss 12:56, 3 October 2023 (UTC)
It will remain much more hit or miss than our other content. I would argue that such grossly incomplete coverage misleads our users and that it will never be anything but hit or miss. DCDuring (talk) 14:05, 3 October 2023 (UTC)
@TheDaveRoss I came across the term hot button today. How should we handle the trademark issue? Check the validity of the trademark (I think there's an annual fee that has to be paid.)? Keep the comment in the etymology section to warn users of the expression that there is a risk in using it? Tell users not to take the trademark comment seriously? Remove the comment? Should we categorize the expression and its alternative form as trademarked items? Should we research the trademark to find out what usage is covered? DCDuring (talk) 16:57, 3 October 2023 (UTC)
You can look up US trademarks at Justia for free, no idea if that is going to be a persistent resource or not. I don't think I would keep the trademark note in the etymology necessarily unless it is clear that the term was popularized by that usage, since the usage predates the mark. The more interesting cases are when the origin of the term is from a trademark. - TheDaveRoss 18:34, 3 October 2023 (UTC)
"Interesting" is not the main issue. What expectations are we creating among users? Who is going to spend time investigating existing references to "trademark" in our entries. DCDuring (talk) 23:12, 3 October 2023 (UTC)
As with most things trademark related, the onus is on the holder to protect the mark, which is the whole point of the discussion here. When someone comes along and "protects" their mark in some way, we can respond to that. If nobody says anything we can keep going with the current practice of providing descriptive definitions and only noting trademarks where the editor deems appropriate. - TheDaveRoss 13:27, 6 October 2023 (UTC)
In this concrete case the term is just about attestable before the trademark was assigned, but that doesn't mean it was widely used. Regarding checking the currency/validity of trademarks, I think that's out of scope for this project. Jberkel 08:02, 5 October 2023 (UTC)
Wow, Wiktionary is growing up, if people are finally able to get companies to pay them to 'protect' them on Wiktionary, like they pay people to edit Wikipedia.
Obviously, the people making the most inappropriate edits won't disclose their COIs in the first place, so it may behoove editors to watchlist the entries in the Trademarks and Genericized trademarks categories. (And If we happen to discover that someone has been editing entries in a POV way with an undisclosed COI, would we agree that's a blockable offense, falling under "unacceptable conduct"?) But for editors who do want to play by the rules and disclose their COI, I think asking them to suggest their edits in a forum like the Tea Room, rather than just making the edits, is a reasonable idea. Companies can object to whatever they object to, we can vet their proposed changes, ... - -sche (discuss) 00:13, 30 September 2023 (UTC)
I doubt that even a large portion of all terms with trademarked usage are included in Category:English trademarks. (BTW, trademarks are an entity in a legal jurisdiction, not in a language per se.) Couldn't we create a category populated by a template inserted on an entry's talk page for this rather than clutter up Tea Room or the entry page itself? We can insert a table for oldest additions to the table and have a subcategory for items that are more than a month or a year old and undertake to resolve the trademark-related issue in some finite time. DCDuring (talk) 14:46, 30 September 2023 (UTC)
Thank you all for contributing to this conversation. For anyone wanting to look up trademark registrations free of charge, this website is a useful resource: https://www.tmdn.org/tmview/welcome#/tmview. I think the trademark issue only arises where the Wiktionary entry gives the impression to the user that the term is generic, where in fact it is protected as a trademark, as this misleads the user (or at least doesn't fully inform them) and risks the user acting in this belief, only to then find themselves facing legal action. So perhaps for all entries where "genericized trademark" is used, a sense check can be made (using the trademark search tool above) and warning added if the trademark is still protected in some territories. As with everything else on Wiktionary, rather than creating work for the admins to police this, it may be that there is only enough resource to deal with this ad hoc, as and when a user like me raises the issue for specific entries? i.e. the brands need to take an active part in the community to tackle the issue.
In case helpful to understand the legal context behind this request: Registered trademarks can be revoked under English law (and many others) if, as a consequence of acts or inactivity of the proprietor, it has become the common name in the trade for a product or service for which it is registered. This is why the trademark owners are actively taking steps to prevent their marks from becoming generic or being termed generic. Further, under English law, if the reproduction of a trade mark in a dictionary, encyclopaedia or similar reference work, in print or electronic form, gives the impression that it constitutes the generic name of the goods or services for which the trade mark is registered, the publisher of the work must, at the request in writing of the proprietor of the trade mark, ensure that the reproduction of the trade mark is accompanied by an indication that it is a registered trade mark. https://www.legislation.gov.uk/ukpga/1994/26/section/99A. Hence my suggestion of getting the brands to take an active part in the community if they feel their mark is being misrepresented in a specific entry. StobbsOBE (talk) 13:30, 4 October 2023 (UTC)
Right. Wiktionary, in effect, is in the business of gathering evidence for the defence, for free. We generally avoid coverage of trademarks unless there is some evidence of genericization (a minimum of three attestations of such use). If someone were inclined to avoid violating a trademark, they would probably check a definitive database, not rely on Wiktionary. Maybe the best thing we can do is have a page Appendix:Trademarks to which we direct people and which in turn directs users to various jurisdiction-specific and other databases. DCDuring (talk) 14:08, 4 October 2023 (UTC)
@DCDuring The dilemma seems to be that, while we want to be descriptive about the usage, the fact that people use terms generically does not mean that we are allowed to present them as generic if the mark holder has asked us not to do so. - TheDaveRoss 14:17, 6 October 2023 (UTC)
Who sez? DCDuring (talk) 14:21, 6 October 2023 (UTC)
Well, for one, the British government based on the link provided above. - TheDaveRoss 13:16, 9 October 2023 (UTC)
The legislation seems to be based on the notion that reference works are prescriptive rather than descriptive. If so, we've been doing it wrong from the beginning.
Have we ever been informed of such a written notice? Presumably, it is WMF legal that would get such a notice. I hope they would inform us if action is to be taken. DCDuring (talk) 14:48, 9 October 2023 (UTC)
First, it is not the British government or under English law in particular but the European Union. The exact wording is found in Article 12 of the EU trademark regulation for European Union trade marks and for German national trade marks § 16 MarkenG. The applicable law for EU courts follows from Article 8 Rome II Regulation – I don’t know about US private international law.
And the provision hardly applies. We don’t ever “give the impression that it constitutes the generic name of the goods or services for which the trade mark is registered”, as we are too comprehensive as to make frequency claims. For a descriptive dictionary of all words necessarily covers trademarked generic names. This sect. 99A seems more for dictionaries containing a selection of terms claimed to be of some general use within a language community. Otherwise we at best only claim a term to be “a generic name”, not “the generic name”, and even that is doubtful, since it is conceivable that a dictionary or glossary describes trademarked products: many things called “a dictionary of …” are actually encyclopediae that can also touch upon the history of brands: Any claim that we fall under that provision would be the result of abstraction and have to be substantiated. Didn’t happen for ritz, the definitions do not even describe a generic product nor service.
As with any EU law, not the actual wording (of which 24 language versions exist) is relevant, but the intention behind it within the statutory system, the possibility that a brand could lose distinctiveness and thus be liable to deletion due to having been become a generic name (in the UK sect. 46(1)(c), in Germany § 49 Abs. 2 Nr. 1 MarkenG). So in spite of the German version as in § 16 MarkenG speaking of “eine Gattungsbezeichnung” instead of “die Gattungsbezeichnung”, there would have to be the possibility of “the common name in the trade for a product or service for which it is registered”, therefore a German commentary to § 16 MarkenG specifies that the impression must arise that it is “die allgemein übliche Benennung einer Produktkategorie im Verkehr”. The reasonably well-informed and reasonably observant and circumspect average consumer needs to assume from reading “dass die Marke im Verkehr ganz allgemein für das Produkt steht, ohne auf einen bestimmten Hersteller hinzuweisen.” Quite an achievement for a random page in this dictionary, which regularly fails to label the frequency of a term in opposition to other formulations a language community alternatively uses. The editor above linking to “English law” misrepresents “the legal context”, which from an objective rather than interest-slanted interpretation should lead to a restrictive interpretation.
And the impression will be avoided in any case if we note something about the trademark in an etymology or usage note. Then we can present what we descriptivistically find fit beside warnings. The legal consequence of such a claim is only “the insertion of an indication”, “Einfügung eines Hinweises”, in the same commentary BeckOK MarkenR/Eckhartt, 34. Ed. 1.7.2023, MarkenG § 16 Rn. 11. “Wie der Hinweis im Einzelnen ausgestaltet wird, bleibt dem Verleger überlassen.” “The way of presentation of the indication is left to the publisher.” Of course it does not matter “how the brands feel” – they have no claim to falsify statements of generic use. There is no general protection from sequences equalling brands “being termed generic”. An impression (to an attentive and intelligent consumer) of not only generic but also singularly common use for a product or service congruent with the one the trademark is registered for would have to be created for claims to arise, which would then be settled by a mere notice, and we would still rightfully claim the term being generic—probably with more damage to the distinctiveness of the brand due to the Streisand effect. Making trademark claims for this dictionary is hence pointless to all parties. We don’t create the legally relevant impression, and if we did then the solicitors would even more. And well lawyers shouldn’t try to manipulate the content of dictionaries either if they are (intentionally or not) less attentive and understanding than the target group in the field who regularly feels the need to consult such a dictionary, likely after already having encountered an unknown generically used term. Fay Freak (talk) 21:11, 9 October 2023 (UTC)
I thought the UK decided the EU was passé and started following their own laws again. Very bold of you to note that the EU laws exist in 24 languages, but choose to translate the German versions rather than just sharing the English version. Thanks for, as usual, providing clarity with your brief comment. - TheDaveRoss 19:32, 13 October 2023 (UTC)
@TheDaveRoss: How do you imagine “following one’s own laws” works? I hope not how Brexiteers do. Then you thought wrong. Would not be too surprising though for a Usonian to get it wrong after the leave campaign’s tubthumping of the single largest English-speaking country exiting the EU.
Given the extent to which political decisions affecting the United Kingdom had been conferred to a unional level, the acquis communautaire had to be copied over, where necessary, rather than reinventing the wheel, during the busy leaving process: Copying wasn’t necessary when the UK lawmaker, in his extant acts, mediated the content of directives, the most frequent of the legislatory forms of intervention of the European Union, by reason that it leaves discretion in terms of implementation in national law while formulating some general standards, as demanded by the principles of subsidiarity and proportionality limiting its competences according to Article 5 Treaty on European Union: Hence “a directive shall be binding, as to the result to be achieved, upon each Member State to which it is addressed, but shall leave to the national authorities the choice of form and methods” as defined by Article 288 Treaty on the Functioning of the European Union. Of course within that discretion, the most straightforward method to transpose the directive and having “one’s own laws” is just repeating the directive verbatim, respectively parts of it, as happened in this case of the Trademark Act, so there isn’t anything more to share than potentially the context of a chain of repealed directives – tried to not make it complicated, as I would merely multiply identical source texts and give the false impression that a directive has direct effect in law. The point is that you all had a quite wrongly constructed vision of “what it says”. It has to be interpreted in a linguistically sensitive way that does not depend on the English language. In that fashion it should be debated how Wiktionary’s presentation would fall under that provision for “dictionary, encyclopaediae or similar reference work” – which again exists mutually translateable in the German “own law”, so I could take the German legal commentaries to find what they think about or how one would apply that provision, which should be the same across countries.
Even for non-EU countries their statutes are copypasta from Brussels, it’s called the Brussels effect, and it does not need US-style originalist interpretation to realize that these legal provisions will have to be interpreted as the EU legislator intended anyway, uniformly beyond any authoritative individual language: for this single reason the European Court of Justice is instituted, to find the law somewhere “behind the lines” of the German and French if not English version, while only demagogy drawing an analogy to the customary Anglo-Saxon situation portrayed a picture of the Court, rather than its political institutions, “making the law”, which was beyond comprehension on the continent.
The interpreter does not make the law, this harmonized law at least, it is wrong though it be a convincing topos if one speaks English. You know that kind of bias in science if one’s selection is slanted towards a tradition or certain social institutions because one took only English-language information? According to the “continental” legal tradition, which is at work in the EU by its founding history, courts don’t make the law, and the English courts can’t do now either, unless to the extent they already used to ignore the law (which they definitely had a natural tendency to do while members, as you rightly suspected, a cultural difference, without expansion upon shocking anecdotes from English courtrooms on how they just ignored it when explicitly pointed out by German solicitors). Should be brief enough for you. Let’s not pretend lawyers make a living from making people’s life less complicated. When they portray the case as clear, they also should be suspect to frame some interest to have an effect, not to say fool us. In this case there is some incompatibility / incommensurability between US categories on trademark law which the English trademark lawyer attempts to invoke, talking to people here, when the relevant provisions, in spite of their familiar common law verbiage, are actually translated from a different legal system. To interpret scientifically correctly according to legal science, I contextualized it thus, though there are simpler ways to conclude that he is wrong. You don’t regularly get clarity with bad laws save from court. But you might understand why it is not as clear either as he portrayed it. Fay Freak (talk) 22:59, 13 October 2023 (UTC)

New categories created by User:Sokkjo/User:Victar

User:Sokkjo/User:Victar has created dozens of new topic (i.e. "related to X") categories over the last month. I have been RFD'ing the ones I think are poorly conceived or redundant but I think we need (a) a temporary moratorium on any new categories created by this user, (b) a more general discussion on what topic categories are needed. Most recently, for example, Victar just created three categories 'Length', 'Width' and 'Distance' that IMO are poorly conceived; we need at most one of these, and probably none. 'Distance' contains only three Proto-West-Germanic terms, the equivalents of far, near and by. I don't think topic categories were meant to be vague groupings of terms like this; they should be specific and should have clear criteria for what goes in them. He also created Category:Sitting, again barely populated, and a higher-level category Category:Rest containing Category:Sitting and Category:Sleep (IMO a poor grouping that makes little sense). These higher-level categories are particularly problematic because they result in dozens or hundreds of language-specific versions getting created that are populated in this case only by one subcategory. Another example is Category:Slaves, redundant to Category:Slavery. I'm sure there are a lot more similar cases; these are just the first few I've looked at. Benwing2 (talk) 22:29, 29 September 2023 (UTC)

Agree with Benwing; also I think editors should hold a discussion to get community consensus first before setting about creating new topics. ·~ dictátor·mundꟾ 14:43, 30 September 2023 (UTC)
I don’t know about that one, editors create new topics when they think they will reliably be filled across multiple languages with two-digit numbers of entries. Categories for body positions make sense, just as a category for exercises does. The tree should also make sense of course. Fay Freak (talk) 14:53, 30 September 2023 (UTC)
Sokkjo has edited a lot of the modules lately, but generally, his edits seem sensible, so I don't think this behavior really requires a change in policy, possibly just a gentle admonishment to think twice about vague categories and ensure that there are a solid dozen to 20 terms that can plausibly fit in a category or that a larger category really needs to be broken up because it has 100/200+ entries. —Justin (koavf)TCM 15:03, 30 September 2023 (UTC)
For those reading:
I'm bound to make some mistakes here and there, so any oversight is more than welcome. -- Sokkjō 01:05, 1 October 2023 (UTC)

dealing with user-competency categories for invalid languages

I am going through and cleaning up user-competency categories such as Category:User fr-4. One issue is that some categories refer to nonexistent language codes, such as 'bxr' (ISO 639-3 code for Russia Buryat; we have a single Buryat language 'bua'); 'kv' (Komi; we have separate codes for different Komi lects); 'eml' (Emigliano-Romagnolo; we have separate codes for Emilian and Romagnol); 'hmn' (Hmong; we have this as a family code); and most problematically, 'hr' (Croatian) and 'sr' (Serbian) (these latter two are specified using {{movecat}}, which says that they "should be empty", but they're not). What should be done? I have been deleting such categories when they're empty, but many are not. I can think of various solutions:

  1. Add support to the category-handling code for such codes, with a manually-specified language name, and a prominent warning issued stating that this is not a valid language code.
  2. Delete the categories even when populated, and leave them deleted ({{auto cat}} won't work on them since it won't recognize the code).
  3. Ask the users in question to change their Babel boxes. (IMO this is unlikely to work; many or most of the users in question are inactive and some are blocked.)
  4. Go ahead and change the users' Babel boxes to contain the right code, when we can determine what it is. (Problematic for various reasons.)

Thoughts? Benwing2 (talk) 01:38, 30 September 2023 (UTC)

FWIW, if we finally remove inactive users (per vote) from the categories, that + asking the remaining, active users to change should solve #3. For cases like 'bxr' and 'hr' (maybe even for 'eml') where the user is using a code that's valid in ISO and maybe on other wikis (where they may be transcluding their user page from) and just not accepted here, my preference would be to make {{Babel}} auto-shift the users into the right category (if possible), or allow the categories as subcategories ('bxr' as a subcat of 'bua', maybe 'eml' as a subcat of both 'egl' and 'rgn', etc), especially if we allow etymology-only codes. - -sche (discuss) 02:57, 30 September 2023 (UTC)
@-sche I think changing {{Babel}} to shift categories or simply not categorize in certain cases is a great idea. However, one big hiccup is that many user pages use the #babel parser function (see mw:Extension:Babel) rather than the {{Babel}} template, which we have no direct control over. I think the correct solution is to file a bug report to disable this extension (if possible), and do a bot run to convert all uses of #babel to {{Babel}}. There are some real weirdnesses in #babel; e.g. it supports codes 'mis' ("unsupported language"), 'zxx' ("no linguistic content"), and 'und' ("undetermined language"), and one user with < 20 edits claims to be fluent in all three. Benwing2 (talk) 03:40, 30 September 2023 (UTC)
And the text in the babel parser function comes from translatewiki.net, which means we can't edit it, which is highly unusual for a WMF project. Soap 00:14, 1 October 2023 (UTC)
@Soap Yup, another issue. If we make our own, I can easily write a script to convert the existing localization data (which appears to be here: ) to a local module that we can then edit. Benwing2 (talk) 00:20, 1 October 2023 (UTC)
And that data is sometimes wrong, e.g. it uses Danish text for the Greenlandic Babel (such as kl-1, whereas we have Greenlandic text), an issue noticed as early as 2015 but unfixable because even the users of translatewiki are (or were, according to that 2015 discussion) unable to edit pages on the wiki. Nonetheless, I suspect the developers might not be happy to turn off something they spent time developing. No harm in asking, but it might be easier for us to just use an edit filter to prevent people from adding it, advising them in the warning message to use {{Babel}} instead, and switching existing uses. - -sche (discuss) 02:52, 1 October 2023 (UTC)
@-sche Another good idea :) ... I'll wait for further comments before implementing. Benwing2 (talk) 02:59, 1 October 2023 (UTC)

Merging the middle and mediopassive labels

This is ridiculous. Both terms mean the exact same thing, so they should also print the exact same thing. Thadh (talk) 23:31, 30 September 2023 (UTC)

Which term do you recommend using? —Justin (koavf)TCM 23:38, 30 September 2023 (UTC)
I honestly don't care, so if people have a preference, I'll be happy to choose either one. Thadh (talk) 23:43, 30 September 2023 (UTC)
@Thadh You are presumably referring to the form-of tags 'middle' and 'mediopassive' in Module:form of/data? I'm not sure how they got in this state and I don't have a strong opinion about this but they are arguably different, in that mediopassive is the merger of middle and passive rather than the same as middle. E.g. Ancient Greek has both a middle and passive voice, whereas Modern Greek has a mediopassive that subsumes both functions. Benwing2 (talk) 00:27, 1 October 2023 (UTC)
@Benwing2: The same problem is also present at Module:labels/data.
I don't think that this distinction is one many linguists - if any at all - follow consistently, because the function of the middle is not defined, and is rather a catch-all term for "voice that is neither active nor passive but something in between". There is also absolutely no language having a contrast between the two, and indeed Ancient Greek grammars call its mood "mediopassive" just as frequently as "middle". Thadh (talk) 01:33, 1 October 2023 (UTC)
@Thadh You may be right but "absolutely no language" seems a strong statement. Modern grammars of Ancient Greek, for example, say there are three voices in at least some tense/aspect combinations. Cf. Wikipedia Mediopassive voice:
Ancient Greek also had a mediopassive in the present, imperfect, perfect, and pluperfect tenses, but in the aorist and future tenses the mediopassive voice was replaced by two voices, one middle and one passive.
English similarly has a three-way syntactic distinction between "The pilot landed the plane", "The plane landed", and "The plane was landed by the pilot". Benwing2 (talk) 01:51, 1 October 2023 (UTC)
@Benwing2: I think you misunderstood, I referred to the contrast between middle and mediopassive. What you call "mediopassive" is simply the middle voice used for functions that would usually be called passive, but it is pretty common for any grammatical category to take over the meaning of another grammatical category in specific situations. Thadh (talk) 19:37, 1 October 2023 (UTC)
@Thadh: I see, you are saying no single language has a contrast between middle and mediopassive voice. But nonetheless the meanings seem different, and in general there is a current preference towards including tags for the appropriate language-specific terminology. Cf. this comment in Module:form of/data:
NOTE: In some cases below, multiple tags point to the same wikidata, because Wikipedia considers them synonyms. Examples are indirect case vs. objective case vs. oblique case, and inferential mood vs. renarrative mood. We do this because (a) we want to allow users to choose their own terminology; (b) we want to be able to use the terminology most common for the language in question; (c) terms considered synonyms may or may not actually be synonyms, as different languages may use the terms differently. For example, although the Wikipedia page on w:inferential mood claims that inferential and renarrative moods are the same, the page on w:Bulgarian_verbs#Evidentials claims that Bulgarian has both, and that they are not the same.
Benwing2 (talk) 20:11, 1 October 2023 (UTC)
@Benwing2: Hm, I guess that actually makes sense. But can we still at least link them to the same article? It seems to be causing confusion for readers that one links to "mediopassive voice" while the other just to "voice", with no hint at the difference (or rather agreement) between the two. Thadh (talk) 22:48, 1 October 2023 (UTC)
Note how the merging would not only affect {{lb}}, but also {{infl of}}, which to me seems to be a greater problem.
I don't know the Greek situation, but Albanian is usually analysed to have three non-active voices: middle, passive and reflexive. These are just functions, and are all conjugated the same way in the "mediopassive conjugation". Albanian sources mainly just call the conjugation "passive", pësor, while grammars in other languages refer to it as "mediopassive", although recent Anglophone papers seem to be gradually trying to shift towards "non-active".
Albanian mediopassive verbs say {{infl of|sq|X||mp}} whenever an active counterpart exists. If we make "mediopassive" an alias of "middle", that would be a problem since saying that a verb is the "middle of X" would imply the other two functions of the mediopassive conjugation, passive and reflexive, would not be used for the verb.
In any case, my understanding of the middle voice is, as Thadh provided, "voice that is neither active nor passive", hence middle does not encompass passive by its very definition. In Albanian examples of middle voice (function) would be stuff like "jump", "rotate", "struggle", of course, if the verb is in the mediopassive conjugation, if it's in the active conjugation they're just called intransitive. This is clearly distinct from the passive function. Of course I stress again I'm strictly referring to Albanian terminology. Catonif (talk) 12:24, 1 October 2023 (UTC)
At the very least, if we were to keep a separate "middle" label, it should be displayed as "middle voice", and I'd be inclined to do the same for "active" and "passive". All three are generic words in English, so they might be confusing to the vast majority of our users, who are not linguists. "middle voice"/"active voice" is less ambiguous, because those are established terms in linguistics, and looking them up on e.g. Google would likely surface grammar-related results near the top. "mediopassive" doesn't need that treatment, because - as far as I'm aware - this word isn't used outside linguistics. Chernorizets (talk) 20:42, 1 October 2023 (UTC)
@Chernorizets: I am not keen on emphasising 'voice'. In early Middle Indic, such as Pali and Buddhist Hybrid Sanskrit, the Old Indic 3-way contrast of active, middle and passive became a 4-way formal contrast of active v. passive sense, each with active and middle forms. @Dragonoid76. The semantic difference was bleached between the active and middle forms, and grammars don't agree on the assignment of some forms between active and middle. (The middle forms are generally infrequent.) The passive active seems to be a Middle Indic innovation.
@Benwing2: Can we customise the labels of {{inflection of}} by language or would we have to fork the base module? I wouldn't want Latin's 'present indicative active' to become 'present tense indicative mood active voice'. --RichardW57m (talk) 09:15, 2 October 2023 (UTC)
@Chernorizets: I see that the glossary, linked to from the labels, has been updated to reduce the need to do crude searches. --RichardW57m (talk) 09:27, 2 October 2023 (UTC)
@RichardW57m I only updated the labels used in {{lb}}, not the tags used in {{infl of}}, for the reasons you mention. Benwing2 (talk) 09:37, 2 October 2023 (UTC)
@Benwing2: Replying to the wrong paragraph is very confusing. The key point is that the user's first port of call should be the hyperlinks on the displayed labels, and the text linked to has now been improved. --RichardW57m (talk) 09:54, 2 October 2023 (UTC)
Sometimes the reply tool pushes a response downward from where it was intended if a subsequent paragraph is indented in a certain way. I havent figured out the pattern yet, but I noticed somewhere up above in a different thread it looked like someone had replied to me but their comment was clearly intended to be placed on the next highest thread. Sometimes I use the manual editing mode to make sure my reply goes where I want it, but not always, because it can be inconvenient for very long threads. Soap 20:23, 2 October 2023 (UTC)
It might be my idiosyncratic way of looking at it, but I don't see middle and mediopassive as exactly the same thing in Ancient Greek. In the strictest sense, mediopassive forms are present, imperfect, perfect, and pluperfect forms that are used for passive meanings and non-passive meanings depending on the verb. Middle forms are aorist and future forms that are mostly used for non-passive meanings, and that contrast with aorist or future passive forms that are mostly used for passive meanings. I say mostly because meaning doesn't always match form: in some verbs aorist middle forms have a passive meaning and in some verbs aorist passive forms have a non-passive meaning. I think this is especially common in more archaic Greek, such as epic poetry; for instance the weird aorist passive form of φαίνω, φᾰ́νην (phánēn, I appeared), identified as such by its ending.
"Middle" (search) seems more common in labels in Ancient Greek entries than "mediopassive" (search). I don't know how the two labels are used by other people and I haven't edited Ancient Greek entries in a while, but I believe I used "middle" rather than "mediopassive" as a label for senses that apply to present, imperfect, perfect, or pluperfect mediopassive forms as well as aorist or future middle forms. To me the "middle" label clearly includes mediopassive, but "mediopassive" leaves me wondering whether the aorist or future middle are included, or the aorist or future passive, or both, or neither, so I avoided it. — Eru·tuon 04:28, 2 October 2023 (UTC)
When I use {{infl of}} in Ancient Greek verb-form entries, I use |mp for forms in the present, imperfect, and perfect that could be either middle or passive depending on context, and "middle" for forms in the future and aorist that can only be middle, not passive. If we decide to get rid of the term "mediopassive" (which is a term I never heard in all the years I studied Ancient Greek), I think it should be replaced by |mid//pasv rather than by |mid. —Mahāgaja · talk 09:39, 2 October 2023 (UTC)
@Mahagaja: I think my main point of disagreement with both you and Erutuon is that while I view at this issue from a purely morphological point of view (middle being the -μαι/-μην inflections, passive being the -σομαι/-θην inflections), you seem to make the distinction on the syntactical plane (middle being the "plane landed" voice, passive being the "plane was landed" voice).
I personally prefer the morphological distinction being held up in the form-of template at the very least. For the labels, I'm not quite sure which one is preferrable. I do agree with @Benwing2 that we should have some way to account for language-specific terminology, but I also think that we shouldn't confuse people with thoughts of a morphologically distinct middle and mediopassive. Thadh (talk) 10:28, 2 October 2023 (UTC)
I don't think that "mediopassive" is a distinct voice (or any kind of morphological category); for me it's just a cover term for forms that can be either middle or passive. I really only use it as a shorthand for |mid//pasv. —Mahāgaja · talk 10:42, 2 October 2023 (UTC)
Exactly, that was my point, too, that the mediopassive is only a syntactic merger of the middle and the passive, and afaik always morphologically using the middle forms. Thadh (talk) 10:48, 2 October 2023 (UTC)
I think we should keep things the way they are. The mediopassive isn't the same thing as the middle or the passive ... it's a single grammatical voice that covers both semantic functions. The term exists because it's a needed distinction. Soap 20:23, 2 October 2023 (UTC)