Hello, you have come here looking for the meaning of the word Wiktionary:Beer parlour/2024/April. In DICTIOUS you will not only get to know all the dictionary meanings for the word Wiktionary:Beer parlour/2024/April, but we will also tell you about its etymology, its characteristics and you will know how to say Wiktionary:Beer parlour/2024/April in singular and plural. Everything you need to know about the word Wiktionary:Beer parlour/2024/April you have here. The definition of the word Wiktionary:Beer parlour/2024/April will help you to be more precise and correct when speaking or writing your texts. Knowing the definition ofWiktionary:Beer parlour/2024/April, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.
Requesting template rf-lemma or rf-entry = creation of this lemma is wanted (urgently). Why?
When parts of etymologies referring to different entries, are moved at a new empty page or at a new L2.sector in the same page (as in principle: "No repetitions"). e.g. I have moved part of Etymolgory from Modern Greek ακανθόχοιρος(akanthóchoiros)to a new page for the red-link to Koine ἀκανθόχοιρος(akanthókhoiros).
@Sarri.greek: I believe {{rfdef|<lang>}} is the template you're looking for. However, there is not much point to just moving the etymology if the new word (ἀκανθόχοιρος in this case) doesn't even have a definition. Yes, we should avoid "repetitions", but if the new word doesn't have a definition, it's not "repetition" to simply not create the word in the first place.
Also, please be careful when moving information, as you moved the "displaced ..." part, which applies to the Greek lemma and not the Ancient Greek lemma. I also could not find the word in Hesychius.
Thank you M @kc_kennylau -sorry, I did not get a ping from my browser for your {reply}. O! And sorrry for my mistake during moving material (I make more mistakes as, alas, I am getting older...) About {wanted} or {requested} lemma. It is especially needed when the 'other' language is in the same page; it happens a lot in Greek. There is no way (orange links and the similar do not help) to make an urgent call to create a lemma. Pages like Wiktionary:Requested entries (Ancient Greek) are wishlists. But an urgent need, means there is a gap, a need for a lemma. It asks editors who might be interested in creating lemmata in this language to begin first and mainly with the {wanted} calls. Thank you. ‑‑Sarri.greek♫I03:30, 4 April 2024 (UTC)
Etymology tree testing
As @AG202 and others pointed out, it is very important to test out major changes before implementing them in mainspace. The problem is that I don't know everyone's use cases. Therefore, I invite the community to suggest terms to create an etymology tree for. If the output is undesirable in some way, I can tweak the template to give a better result. Before commenting, please note the following: 1. No test trees will be created in mainspace. 2. The template requires each ancestor to have an entry. If your entry says "from English redlink1, French redlink2, from Latin redlink3", I can't really do much with that. 3. If there's a problem with the output, please give constructive criticism so I can fix it.
To start, here's an example of a hypothetical Swedish term which is borrowed from English father and calqued from Old English fæder.
This would be generated by: {{etymon|sv|id=whatever|bor|en>father>male parent|calque|ang>fæder>father|tree=1}}
It seems like the Persian character is tall enough to slightly overflow the box. I'm not sure what I can do about this, since the font size is actually being pumped up by the {{m+}} template which is used within the box. Ioaxxere (talk) 14:33, 2 April 2024 (UTC)
While I agree with the reservations expressed by others in the earlier March discussion on this same proposal, I have to say this looks pretty good already. If the infrastructure behind it were robust enough, it could become a pretty neat addition to the project. — Mnemosientje (t · c) 13:47, 2 April 2024 (UTC)
100% agree that this is visually interesting, more intelligible, and handy. I totally support this and have only minor cosmetic tweaks to suggest. —Justin (koavf)❤T☮C☺M☯16:49, 3 April 2024 (UTC)
@Ioaxxere A bit belated, but I like the idea. My only (minor) criticism is to ask why the text inside the boxes is so large. Is there any particular reason for that? Could we reduce it to the same size as the other ordinary text found in the rest of a given entry? (Or is it just that way on my display, for whatever reason? Not sure if it shows up normal-sized for other people.) — Vorziblix (talk · contribs) 19:46, 10 April 2024 (UTC)
@Vorziblix: Yes, the font size was (by default) slightly larger than regular text (16px versus 14px). I've gone ahead and made it smaller (although I'm not totally sure whether it's better this way...). Ioaxxere (talk) 20:13, 10 April 2024 (UTC)
This looks pretty neat! I want to suggest using a tabular format (left- or right-alignment, like two columns) instead of center-alignment for the boxes. It is much easier to read and compare similar data when they are aligned with each other, rather than arbitrarily aligned due to varying text width. (I'm not sure how it would best be implemented – maybe a consistent width for language name and etymon – but anyway should be tested on mobile screen sizes.) I also find the small "?" icon not immediately intuitive (despite the tooltip) and not emphatic enough for representing anything that is iffy (i.e. too easy to miss), and would prefer spelling out the word "uncertain" instead, perhaps with a change in box color contrast as well. It would be nice if the collapsible box header said "Etymology tree for <word>" instead of "Etymology tree". Hftf (talk) 22:10, 17 April 2024 (UTC)
@Hftf: Thank you for the feedback! 1) Using a tabular format would mean packing everything into a rectangular grid, which I don't think would look good. 2) My ideal solution for uncertainty would be to have a dashed line, but unfortunately this doesn't seem to be technically possible. I'm not sure I like the appearance of uncertain spelled in full — but I'm fine with doing that if other people agree. Another idea would be to change the colour of the box itself, so every uncertain etymon might be pink rather than beige. I don't want to make any major design changes at this point though, since I'm preparing to start a vote on the template in the next few days (see #Etymology tree vote). 3) The idea of the template is to be used on the entry page itself, and of course there's no need to remind the reader what entry they're on. However, if we started adding trees to other pages you would be absolutely right. Ioaxxere (talk) 17:16, 18 April 2024 (UTC)
Thanks for the response! I don't mean packing everything into a rectangular grid per se – but would it be possible to experiment with using fixed widths (couple hundred pixels, but overrideable) and a left (or right) alignment for the two primary pieces of information (language and etymon) at least? The dynamic non-space-constrained medium of the screen allows us to not need to rely heavily on abbreviations that often plague most printed dictionaries, while balancing screen size constraints, sleek aesthetics, information density, colorblindness accessibility, and other factors. I still think there is net value in a "redundant" header – taking a screenshot of just the tree widget to embed somewhere else, for instance, would benefit from it, and that space is being used for nothing anyway. Hftf (talk) 00:00, 19 April 2024 (UTC)
As has been discussed a number of times, including Wiktionary:Beer_parlour/2016/August#Suggestion_for_sense_tags_on_antonyms (which I stumbled upon while trying to find a different discussion where I suggested the same thing), it perennially confuses lots of people that we write things like "(to start work): clock in, clock on, punch in, go on the clock" ... so I've gone ahead and made the T:antsense template Benwing proposed in the 2016 discussion; you can see it in use now, displaying "(antonym(s) of "to end work"): clock in". Please feel free and encouraged to make the template better, find a better name, whatever... once it's in a state everyone's happy with, maybe we can bot-deploy it to entries and finally end this enduring source of confusion... - -sche(discuss)02:26, 2 April 2024 (UTC)
@-sche: I just now found out about this template, and honestly, it confused me even more, and in the same way you pointed out above ("big" does not have a sense that is translated as "antonym of big"...; and furthermore in some language it might!). In my opinion, the only way to solve this so that it is clear what is meant, is by doing something like suur#Antonyms (bar the new template), where the meaning of the antonym is given within the link template. Thadh (talk) 21:53, 11 April 2024 (UTC)
Hmm... so you interpret
Antonyms
(antonym(s) of "begin working time"): clock out
as saying antonym(s) of "begin working time" is one of the definitions of clock in? (or that it is the definition of clock out?) If anyone else interprets it that way, I hope they'll chime in; to me, the parenthetical (s) and the fact that part of the text is set off in quotation marks and quotes the definition further up the page makes it very unlike anything that appears in our definitions of words — I don't think any language has a single word that we would translate as "antonyms, plural, of quote "big" unquote", we would definitely format it differently, without quotation marks, and "antonym" would be exclusively singular — so I would not have expected anyone to interpret it as a definition. But your comment shows we could benefit from being even more explicit! What if we change the wording to (the following word(s) is/are antonym(s) of "begin working time"): clock out? (The beauty of this being a template is that we can change the wording in one place and it will propagate out to all entries.) - -sche(discuss)22:15, 11 April 2024 (UTC)
We don't always use {{sense}} to give the gloss already used in the entry, sometimes we make it a little more specific. If an entry has a meaning "not X" (which happens quite often, English doesn't always have a good equivalent for these), then it would actually make sense for me to say "antonym of X" as a sense. The parenthesised "(s)" indeed does make it less likely, but it's also pretty easily missed.
It's possible that I interpret it that way because I am used to antonym sections using the sense as in the entry, rather than describing the following term, but on the other hand, so are our readers, right? Thadh (talk) 23:55, 11 April 2024 (UTC)
@-sche: Found a usecase that is even more confusing with the new template: oikia. There is no sense "side", but there is a sense that describes a side (namely sense 1). Especially in the case of many definitions with similar translations into English, it's useful to define either a label or a description in {{sense}}. Thadh (talk) 11:02, 30 April 2024 (UTC)
I agree that adding some clarificatory description of which sense of right is meant is helpful... and I'd go as far as to say it'd be helpful in the definition, too, given how polysemous bare "right" is. I would clarify the definition from "right" → "right (side)", and then copy that down into the T:antsense, and if there were any synonyms, then also into the T:sense next to them, because I don't think it'd be any more helpful for the definition and T:sense gloss there to intentionally mismatch, either. Frequently, when a {{sense}} or e.g. {{trans-top}} gloss doesn't correspond to a definition, e.g. one being "right" and the other being "side", it's because someone changed, RFV-failed, or otherwise removed a definition without updating the corresponding synonyms and antonyms, translations, etc, so I would prefer to avoid having an entry intentionally mismatch, when it'd be straightforward to just have "right (side)" in both the definition and T:antsense (and T:sense if there were synonyms). - -sche(discuss)14:42, 30 April 2024 (UTC)
@-sche: I would put "side" both in the sense of the synonym and in the sense of the antonym, if that's your concern in the first part of your message. For "right" I guess giving a clarificatory "(side)" in the entry might be useful, but consider the following situation:
You have a translation of kiwi glossed in the entry as "bird of the genus Apteryx", "plant of the species Actinidia deliciosa" and "fruit of the species Actinidia deliciosa" by sense, and then a synonym section and an antonym section. Now, to me it seems easiest to just provide {{sense|bird}}, {{sense|plant}} and {{sense|fruit}} as possible senses, the whole {{sense|bird of the genus Apteryx}} etc. seem rather long.
However, the phrase "antonym(s) of “fruit”" would give the impression that some sense is missing, denoting just any fruit, since it's in quotation marks and everything. Thadh (talk) 15:52, 30 April 2024 (UTC)
Hmm... personally, I would not consider using just "fruit" in T:antsense to be any more confusing using it in T:sense, or in between the quotation marks of the gloss parameters of {{m}}, {{bor}}, etc, where we also provide just a short adequate summary of the relevant sense (in quotation marks and everything) and don't repeat the whole long definition. If anyone else finds it confusing, I hope they'll chime in, though. (To respond to some things you've said here and about this in the T:tcl section below,) I know you feel I act unjustly like your opinion is a minority opinion, and I'm not sure what to say to that other than: I hope that if other people agree with you, they'll speak up, as that would be very helpful. I trust and appreciate that you advocate what you think is best for readers. I also know that in particular cases, including when you argue it's worse to make readers read an extra two words here vs. to expect them to "figure out that antonym sections' sense template shows senses of the entry (rather than the antonym)", the evidence seems to be to the contrary. Over more than a decade, I (and other users) watched hundreds of users diametrically misunderstand the old way antonyms were presented and be motivated enough to (mistakenly) edit the entries in an attempt to fix the apparent error, changing the glosses via edits like this which I and others perennially had to undo. (And it seems reasonable to me to suspect that the total number of readers who misunderstood the old way antonyms were presented, or who misunderstand anything, is greater than the number who are motivated enough to edit the entries or comment on discussion pages.) Rua observed so many readers/editors doing this so much that she too wanted to change the template's wording several years ago... and people at that time too argued not to add extra wording, and argued people would figure T:sense out in time... then Benwing, having observed that people continued to misunderstand the old way antonyms were presented, proposed T:antsense... and people again held off back then ... but even when given eight more years, for a total of eighteen years, to "figure out that antonym sections' sense template shows senses of the entry", readers did not in fact do so. The idea that they would do so was demonstrated to be mistaken; readers were still making edits based on misunderstanding the old system right up until the week we introduced T:antsense. Since then, I haven't seen any readers be confused by T:antsense. It's still young, and maybe in time we will start to see readers respond to it in ways that suggest further tweaks are needed... but for now, it seems to have solved the widespread, long-running problem the old system caused. - -sche(discuss)18:29, 30 April 2024 (UTC)
Many of our entries, like king#Etymology_1, include long lists of cognates. Would anyone be interested in a template that could be used to generate these kinds of lists automatically? The template would work by adding terms into a category which would be accessed to get a list of cognates. Ioaxxere (talk) 19:24, 3 April 2024 (UTC)
@Ioaxxere: That's really a job that is done by Proto-Germanic *kuningaz. The only benefit of this sort of template would be that if I had a Proto-Tai descendant on my watchlist, I wouldn't get alerted every time someone added the form from yet another Zhuang dialect. --RichardW57 (talk) 20:54, 3 April 2024 (UTC)
Like Richard said, I don't think this is really necessary, and I am personally of the opinion that these long lists of cognates do more harm than good, so we should rather remove them than make it easier to generate more of them. Thadh (talk) 20:58, 3 April 2024 (UTC)
@RichardW57, Thadh, Nicodene: Would each of you mind confirming or correcting my impression that your objection to Ioaxxere's idea is really a more general objection to long lists of cognates (such as the one in king#Etymology 1, thrice (!) the length of the rest of the etymology) and their facilitation, please? Ioaxxere currently has a vote running to introduce an “etymology tree” template, which I think would be particularly good for displaying cognate relations in an elucidating way; imagine, for example, a display like that in the collapsible box entitled Etymology tree of English puny in Wiktionary:Beer parlour/2024/March#New design, but with inverted branching. What do you three think of that idea? 0DF (talk) 10:33, 29 April 2024 (UTC)
That is why I object, yes, and I'm not sure how a tree model is feasible considering the sheer number of descendants that one word can have, as well as the fact that descendants can and often will split from multiple different nodes. Nicodene (talk) 10:44, 29 April 2024 (UTC)
@Nicodene: I agree that it won't be feasible in every case, such as in the case of Proto-Germanic *kuningaz. It would need to be for a “trimmed-down” selection of cognate terms, I expect. To spitball, perhaps that would be achieved using a |cogs= parameter, with languages specified in the manner |cogs=en,fy,pdt,nl,de,is and individual terms specified in the manner |cogs=en:king,fy,pdt,nl,de,is:konungur. 0DF (talk) 11:05, 29 April 2024 (UTC)
This could be an interesting idea. I had a similar idea once for a separate cognate template - I usually list some number of Lechitic cognates and something like this could be convenient. Vininn126 (talk) 11:08, 29 April 2024 (UTC)
@Vininn126: Yes, I am hoping such a thing will be possible for the cognates of German Reichsrat, all of which derive from Proto-Germanic *rīkiją(“ric, realm”) + *rēdaz(“rede, counsel”). In contrast to monomorphemic terms like English king (for which one may simply link to Proto-Germanic *kuningaz), there's no way to point to all those parallel formations other than by indicating cognates, AFAICT. 0DF (talk) 15:38, 29 April 2024 (UTC)
@0DF: Yes, connecting parallel formations to German Reichsrat is possible to do automatically even if we can't reconstruct Proto-Germanic *rīkijąrēdaz. However, I notice that the etymologies of the other Germanic terms don't include a cognate of German -s- (although maybe they should) so we can't be certain that it's really a parallel formation unless we account for this.
When I created this discussion I did specifically have a text-based list in mind rather than a tree. However, generating descendant trees is definitely something that I've planned on working on if the current vote passes. It will probably have to look different to the etymology trees due to how many descendants a single entry can have (see e.g. User:-sche/sugar ).
I'm intrigued by the ideas being discussed here! How did you choose the languages to specify in your |cogs example — is it just by number of speakers? Ioaxxere (talk) 18:37, 29 April 2024 (UTC)
@Ioaxxere: All Reichsrat’s cognates have a joining s and some, like Danish rigsråd and West Frisian Ryksrie, seem as if they could only have been formed with an -s- interfix, whereas others, like German Reichsrat and Icelandic ríkisráð, could have been formed by compounding with the first element in the genitive. Either way, I imagine that's a detail that can be omitted so as not to overcrowd the information presented by the “tree”. That tree of Proto-Indo-European *ḱorkeh₂’s descendants is intimidating, ha-ha! I'm guessing elu-prk should be pra-hel (Helu Prakrit), inc-mgd should be pra-mag (Magadhi Prakrit), pmh should be pra-mah (Maharashtri Prakrit), and psu should be inc-ash (Ashokan Prakrit). Should I correct them for him/her? For my |cogs= list, I chose inherited terms, that have entries, from what I assume to be the most familiarly-named non-historical languages (to native Anglophones) in each of Proto-West Germanic's immediate branches; so English from Old English, West Frisian from Old Frisian, Plautdietsch from Old Saxon, Dutch from Old Dutch, German from Old High German, and Icelandic from Old Norse. Whilst a language-name's familiarity is often proportional to its number of speakers, there are certainly more Danish and Swedish speakers than there are Icelandic speakers; I would be hard-pressed to estimate which of Danish, Icelandic, and Swedish has the most familiar name to native Anglophones, so I opted for Icelandic, largely because I find it a more interesting language than either Danish or Swedish. I wouldn't say that either West Frisian or Plautdietsch is really at all familiar to native Anglophones, but, then again, neither is any of the other languages in the branches that contain those languages. The rationale for choosing one language from each branch was that it would give a broad and representative selection of cognates without the number of cognates being overwhelmingly large. I hope that makes sense and seems fair. 0DF (talk) 01:28, 30 April 2024 (UTC)
Go for it :) I forgot that page still existed when the Prakrit codes were finally standardized a while ago. I suspect it may be out of date by now, too, if anyone has added more obscure languages that inherited or borrowed descendants in the years(!) since I set it up. I was just trying to put everything that existed at that time in one place long enough to count it all, since our actual Proto-Indo-European *ḱorkeh₂ page, as you can see, requires people to click through multiple pages to piece together the whole list. (Whereas the Chinese entry just foists ~233 descendants on you at once.) - -sche(discuss)01:55, 30 April 2024 (UTC)
@-sche: Done! On the topic of “more obscure languages”, I added Khasa Prakrit as an intermediate stage between Ashokan Prakrit and Nepaliसखर(sakhar) because Gujaratiસાકર(sākar) and Hindiसकर, सक्कर(sakar, sakkar) also descend from Ashokan Prakrit; however, I did not add Kamarupi Prakrit → Early Assamese → Middle Assamese as intermediate stages between Magadhi Prakrit and Assameseশিকৰ(xikor), because there's no “branching” involved there. What Chinese entry are you referring to? I ask because the only Chinese term I can see in User:-sche/sugar is 扯克兒/扯克儿(čeker, Beilu Yiyu), which has not yet been created. IMO, Proto-Indo-European *ḱorkeh₂ is the perfect place to put that extensive list of descendants, since it'll never have quotations or notes on currency, as is otherwise quite a stubby entry; AFAICT, reconstruction entries only exist for etymological interest, so why not load them up with etymological details? 0DF (talk) 16:08, 30 April 2024 (UTC)
@Ioaxxere: Or perhaps, rather, to Chinese 茶(chá)? The massive descendants list is fine in Proto-Sino-Tibetan*s-la, but it might be excessive in the Chinese entry; perhaps a collapsible box is warranted in the latter case, or at least the use of {{see desc}}vel sim. for the Dutch and Persian branches (inter alia). There are probably better presentation schemes than the current one. 0DF (talk) 16:54, 30 April 2024 (UTC)
Yes, sorry, by "the Chinese entry" I meant Reconstruction:Proto-Sino-Tibetan/s-la (laxness on my part to call it that). I don't mind listing alllll the descendants on reconstructions pages like *s-la does (indeed, it could be useful), I just observe that in practice there seems to be a tendency (among at least some people) to move "chunks" of descendants to other pages and just append "see there for further descendants", see e.g. Reconstruction:Proto-Indo-European/dʰeh₁-. Of course, (as on *s-la) T:desctree could allow us to "chunk" them up onto different pages and still have them all be visible on Reconstruction:Proto-Indo-European/dʰeh₁-, *ḱorkeh₂, etc. How many people would find that useful vs would be unhappy with such massive lists, I don't know. Regarding how to make *s-la et al. more compact: we have a lot of excess whitespace (in general, in almost every aspect of the layout of our site), so right off the bat—especially because a lot of descendants on the *s-la page are very short—one idea would be to consider ways having multiple columns. If we simply divided the entire list all at once into two (or three) columns, it would make it hard to see the "level of indentation" of the different entries—although frankly, it's already hard for me to track what e.g. Cantonese is indented at the same level as by the time I've scrolled through the intervening massive Mandarin descendants list—so another idea is to only columnize certain things, e.g. "Russian" stays where it is, and everything descended from Russian is in two columns (and likewise for other long lists of mostly same-level descendants). Because we're using indentation to convey what descends from what, even that would probably make it harder to track whether the items in the second column were also descended from Russian, so maybe we could add lines so that the visual result was like this (but ideally with the lines being added by CSS or something). - -sche(discuss)17:48, 30 April 2024 (UTC)
@-sche: I like the mockup! I definitely think we're going to need multiple columns. My screen is wide enough to comfortably fit at least four columns of text, so there's huge potential here in terms of improving readability. The only way I can tell what Cantonese is indented under is by putting my mouse just to the left and scrolling up until I hit something. I and a couple others on the Discord were discussing creating a tree à la{{etymon}}, but horizontal (so descendants are directly to the right of ancestors). Ioaxxere (talk) 04:38, 1 May 2024 (UTC)
Yes, there is an aesthetic objection to long lists of cognates, which general sentiment is against anyway. I also don't like the way they're maintained at present - the pages they're on keep changing. --RichardW57 (talk) 18:01, 29 April 2024 (UTC)
@RichardW57: I share your and the general dislike of long lists of cognates. I’m not familiar with how they’re maintained, with the way “the pages they’re on keep changing”, however; could you elaborate and exemplify, please? 0DF (talk) 00:28, 30 April 2024 (UTC)
An example is Thai น้ำ(náam, “water”). As a new cognate is properly (but usually based on nothing more than a dictionary or vocabulary list) added to Wiktionary on its own page, it gets seemingly manually added to the source of the page of each cognate already listed, with no more than the automatic change comment saying the etymology section has been changed. (And of course, that type of automatic change comment doesn't distinguish changing a section from adding an adjacent section.) RichardW57 (talk) 05:59, 30 April 2024 (UTC)
@RichardW57: Do you mean changes like these four? That seems more like an objection to non-specific edit summaries to me. And if you dislike long lists of cognates, why do you object to their being placed in a collapsible box, thereby reducing the visual noise they present to the reader? 0DF (talk) 16:42, 30 April 2024 (UTC)
I hadn't expressed anobjection to a collapsible box, but that has its own disadvantages. To review the change, I have to expand the box or somehow have it expanded by default, and I still get notified whenever the list is expanded. --RichardW57 (talk) 00:11, 1 May 2024 (UTC)
Thus the name appears with the combining character U+033A being displayed on the dotted circle for demonstration purposes.
U+0949 is a combining character as well, and indeed if you try to highlight the character in the first category name, you are forced to select the space as well. However, visually the dotted circle is also rendered, at least on my browser.
It's not that simple. Whether dotted circle plus mark renders properly is, or has been, renderer and script dependent. Sometimes the result is two dotted circles! --RichardW57 (talk) 04:42, 4 April 2024 (UTC)
It's not "script dependent" under any conventional definition on script, nor should we let the fact that some misbehaving renderers render two circles there (resulting in a minor visual glitch) take priority over correctness/consistency. Lunabunn (talk) 21:17, 23 April 2024 (UTC)
Requiring attribution when moving from one Wiktionary page to another
The edit summary looks sufficient to me. It would have been better, though, to just make a minor edit and say "the previous edit copied from hemoglobin"- although it was obvious enough what they were doing. Chuck Entz (talk) 14:46, 4 April 2024 (UTC)
That's correct, though the attribution can be as simple as saying where you got it from in the edit summary. Of course, not every change is distinctive enough to require attribution- we're talking copyvio or plagiarism, not dotting i's and crossing t's. Chuck Entz (talk) 14:25, 4 April 2024 (UTC)
@A westman: There's a policy WT:FORMS about handling alt forms and soft redirects. Probably you shouldn't have duplicated the information in two entries, because they may easily get out of sync and it's a maintenance burden. Also was your deletion of the Welsh entry actually intended? --Ssvb (talk) 15:45, 4 April 2024 (UTC)
One way to look at it is to treat the collective of Wiktionary editors as a single entity from the copyright standpoint. But if individual editors want to be always personally credited when moving every tiny bit of text from one entry to another, then this looks more like the case of malicious compliance. For example, in this diff I copied the texts of glosses from their English entries. Did I have to explicitly mention these entries in the edit's summary? Did I have to track the actual handles of the editors, who contributed these pieces of text in the first place?
In principle, the wiki engine could do the identification of copied text fragments automatically, reducing the need for manual labour. Yes, this would be very resources intensive and probably won't be implemented any time soon. But theoretically this can be done. --Ssvb (talk) 15:14, 4 April 2024 (UTC)
Wikipedia:Copying within Wikipedia may be relevant. Whatever the formal rules may be, mentioning in the edit summary that content is copied from a certain page is the decent thing to do. We spend a lot of time, energy and ability to write a good Wiktionary entry. We deserve some credit. Vahag (talk) 18:17, 4 April 2024 (UTC)
Some thought has been given to this issue on Wikipedia, and the result was w:Template:copied. The problem is that each individual's (surviving) contribution has to be acknowledged for as long as they retain copyright. Now, it can be done by implicitly referencing the change history of the source, but that only works while the change history remains accessible. That template attempts to protect this history, but I don't know how well that works. In practical terms, the history usually dies with the page. As {{copied}} on Wiktionary was deleted, I suspect the idea went down like a lead balloon on Wiktionary and the collective decision was to live dangerously.
The selection of which cognates to display is sufficiently original to be protected by copyright - the question is then whether the 'fair use'-like exemptions from copyright apply. Wikimedia seems only to worry about US law, but many of us might fall foul of local laws. --RichardW57 (talk) 17:59, 7 April 2024 (UTC)
Separate from the copyright question: A Westman, why were you moving the content from hemoglobin to haemoglobin in the first place? In the case of US/UK spelling differences, the thing we've been doing to be neutral (not inherently favouring one national variety or the other) is centralizing the content on the older entry, which in this case is hemoglobin (which is also, as an aside, the more common spelling). - -sche(discuss)16:58, 4 April 2024 (UTC)
Also just using the older spellings is not "neutral" because most of them are in US English (inherently not neutral). Which makes sense because most Wikimedians are American afaik. ✵A Westmantalkstalk18:53, 4 April 2024 (UTC)
@A westman Favoring UK/Commonwealth spellings is no more "neutral" than favoring US spellings. But in fact your AFAIK about most Wikimedians being American isn't even true; they are scattered across the world, and I actually get the sense more current Wiktionarians are British than American. (From what I've seen, there is a fairly random mixture of pages where the main version is hosted using the British spelling vs. the American spelling.) Benwing2 (talk) 19:40, 4 April 2024 (UTC)
For the last one from what I've seen there really isn't. And I never said UK spellings are neutral, what i did say is that there is no way to be neutral. US spellings are localized to the US and to an extent Canada and the Phillippines but everywhere else CW English prevails. ✵A Westmantalkstalk19:54, 4 April 2024 (UTC)
Yeah, those who prefer UK spellings point to the number of countries that theoretically consider those standard (although how many people there actually use them, when English is not the main native language, is another matter), and those who prefer US spellings point to the number of greater uses of US vs UK spellings (greater number of works using them)... sometimes people try to work out how many native English users there are of one or the other... to avoid pages being moved back and forth because you come along in April and think your preference is the rational one, and then someone else comes along in May and thinks their own preference is the rational one, just use the older entry. However, when a spelling is not merely "alternative" but a national "standard", we could be using T:standard spelling of like this—that template postdates a lot of entries, so many don't use it yet. - -sche(discuss)20:30, 4 April 2024 (UTC)
No part of the page reaches the threshold of originality, in the jurisdiction I know. Luckily, the definition is written by late SemperBlotto in 2005, vouchsafing verisimilitude of lacking creativity. The collective of Wiktionary editors is no natural person and thus cannot be the author of a work, which is the requirement for copyright to arise, nor can there be co-authors if there is nor will nor imaginative power to create a joint work under a joint conception; instead contributions are driven by intrinsic logics of the subject matters and rarely collaborative. Fay Freak (talk) 17:17, 4 April 2024 (UTC)
@Fay Freak: The WMF:ToU#7._Licensing_of_Content mentions "the Wikimedia community" in its text, maybe not as the copyright holder, but it's still mentioned. The Terms of Use also explains that the outsiders only have to provide the article URL to comply with the attribution requirements of the license and that's sufficient. Not having this particular clause would render the content of Wikipedia unusable to be republished anywhere under the CC BY-SA license, because some individual Wikipedia editors would start pestering the re-publishers and demand to be given credit personally. Now back to Wiktionary. Reusing and copying parts of text from the corresponding English entries or from the cognates of the same word in other languages is rather common in Wiktionary. I have witnessed this myself many times. Can we possibly agree that mentioning this is generally not necessary in the edit summary? Because such copying either does not reach the threshold of originality or because the original text is still easily available just one wikilink away. I mean, via cognate links in the Etymology section or via links to the English words in the list of the foreign word senses. --Ssvb (talk) 01:09, 5 April 2024 (UTC)
Of course, either the threshold of originality is not reached (with list-like content, which is almost everything in Wiktionary, sometimes formulating weighing and illustrating aspects, as said in homoeopathic doses for what Blotto wrote there, who described chemistry like one is five), or anyone interested should ask himself and guess the provenience or attribution, or because authors with the licence agreed to collective attribution, even by implication of participation if not expressis verbis: I mean such a site works with a low entry barrier, we can’t rebase like in git commits. Even if a source got deleted one can just ask about the edit history if explaining a legitimate interest, this can play a role for establishing a copyright violation or at least proceeding, otherwise main character syndrome. Fay Freak (talk) 03:28, 5 April 2024 (UTC)
Some authors seem to have agreed to collective attribution, but I don't know an easy way of finding out who has. --RichardW57 (talk) 17:25, 7 April 2024 (UTC)
Limburgish nominal inflection
So this is apparently a meme in certain online communities. Looking at the table, even without any knowledge of Limburgish, it is clear that dative *berem cannot be correct and that the ‘locatives’ are all extremely suspect. Nor is it obvious what the ‘consonant mutation’ refers to (progressive assimilation?). The Limburgish page looks a lot more sane by contrast. 109.184.88.22019:16, 5 April 2024 (UTC)
@Benwing2 I've gone around and simply removed those, as I added most of them when I transforming the original pages' bare wikitables into templates. I based this template on whoever originally added the inflection tables, but they seem either very specific to one specific dialect or just to have been wrong in describing nominal inflection in Limburgish. These questionable non-template inflection tables are still present in some cases, like geo, glee, hieër, hoes, kindj, krieëk, meule, wien, water, and hit. I haven't removed those yet as I didn't originally add them and you might want to ask the original contributor (if they are even active anymore) whether these are correct or can be removed. Though to my judgement the do seem problematic as well, so I'd be fine with removing those as well.
The "locative" case for example seems to have been someone who was to eager with generalising a rule over a few actually existing examples (specifically hieër(“lord”) → hieëves(“to the lord”) & heim(“home”) → heives(“to (the) home”)), instead of being a locative case it may just as well be a regular suffix -ves (which may be related to suffixes like English-ward en Dutch-waarts).
As far as I can tell the dative is identical in most dialects to the nominative, though they were different at some point I don't believe an -m has ever been present. The only dative marking that ever occurred seems to have been a simple -e. Similarly, the "consonant mutation" seems to just be progressive assimilation, which most spelling systems would not spell out anyway so I also don't see why that was added initially. BartGerardsSodermans (talk) 06:37, 6 April 2024 (UTC)
@BartGerardsSodermans Thank you! The original contributor is User:Ooswesthoesbes, who is sporadically active; but given what you've said, I am inclined to simply remove all the problematic tables regardless. I have a hard time believing, for example, there is a separate locative case in Limburgish. Benwing2 (talk) 06:45, 6 April 2024 (UTC)
The inflections given are based on the "High Limburgish" standardisation. After a community consultation on the Limburgish Wiktionary, we decided to drop it altogether and instead use no standardisation. As BartGerardsSodermans indicates, the assimilation is generally left out in spelling, and as such can be dropped as well.
While some dialects do differentiate in genitive, it is mainly tonal, f.e. daa~g vs. daa\g.
My advise would therefore be to remove the templates or replace them with the ones similar to those on the Limburgish Wiktionary. --Ooswesthoesbes (talk) 06:50, 6 April 2024 (UTC)
Last December, WingerBot went through hundreds of pages replacing definitions of countries with this template, which automatically transcludes the English definition on the various language-specific pages. The only reason I noticed this is because this broke some quotation templates' display, but even if we ignore this issue (which should be fixed as soon as possible), in my opinion using this template across languages is a terrible idea:
Languages are very different with regards to how they perceive the world, and proper nouns and even place names are not different. When an English speaker sees "largest country in the world", a Russian speaker sees "home country", whereas a Ukrainian speaker... Well, you get it.
This becomes ever so relevant when you go down in size. A speaker of Votic will see the concept of Canada as fundamentally different from a speaker of Dogrib. While this is difficult to put into words, slight changes in the definition do matter! This becomes evident when you go to smaller-sized place names - The fact that Den Hoorn is located in Midden-Delfland is only really part of the worldview for the Dutch speakers in the region, for a Limburgish speaker it will already only be useful to know that it is located in South Holland, while for a speaker in Singapore this will just be a village in the Netherlands (or perhaps even in Europe!).
In any case I think the indiscriminate introduction of the template accross languages by a bot should be undone, but I also suspect that it might be better to not use the {{tcl}} template at all: Definitions are ever changing, both in the language and in English, and while at one point the English definition might match the one in a given language, it is possible that this English definition will be optimised in the future, leaving the definition in the other language incorrect. Thadh (talk) 11:48, 6 April 2024 (UTC)
Strongly agree that any mass adoption of {{tcl}} should have been discussed properly, and if I were to decide, I'd burn that template with fire. Not only for technical reasons, but with the valid points raised here. — SURJECTION/ T / C / L /11:50, 6 April 2024 (UTC)
The technical issue aside, I strongly am of the opinion that there are absolutely terms that are the same semantically within languages, this removing it wholesale is not necessary. The argument comes across as ideological to me. trójkąt along with tons' other scientific terminology come to mind as being semantic matches. Vininn126 (talk) 11:55, 6 April 2024 (UTC)
What on earth is gained by using {{tcl}} to include a definition of Russia or triangle on an entry instead of just using a link and writing a short gloss? The tcl approach 1. is subject to whims by editors of the English entry, 2. wastes technical resources, 3. introduces risks of breakage with existing templates and gadgets (like the quotation issue that prompted this discussion), 4. makes it harder for editors to see where the definition comes from, 5. makes it harder for people who want to parse the Wiktionary entry data to get senses, etc. It is worse in literally every single way I can think of. — SURJECTION/ T / C / L /11:58, 6 April 2024 (UTC)
1) It provides more precise linking between terms. To your 1) this can be an upside or a downside for particular terms. TO your 2) On 99% of potential matches, the difference is insignificant. 3) There's argument for every template/module for "potential breaking", let's not have any in that case. IPA modules can be changed, too, and break. 4) How? 5) For some people. It's also based on Wikidata, which I'm sure people can parse. Vininn126 (talk) 12:02, 6 April 2024 (UTC)
1 is meaningless. What is "more precise linking" between terms? 2 is never insignificant; this line of thinking is why we had memory errors for years. 4 is obvious; if you try to edit the page to fix a typo in the sense and just see a {{tcl}}, how is an editor not familiar with this template supposed to do anything? Your response to 5 I think just shows you're not familiar with this topic; forcing people to use the Wikidata API too just to display Wiktionary definitions makes no sense. If all you care about is the Wikidata lexeme ID, you can add that with {{senseid}}. {{tcl}} adds no value whatsoever. — SURJECTION/ T / C / L /12:08, 6 April 2024 (UTC)
As to memory issues - we have tons of templates that can cause these problems and it's always on shared pages. This does not stop us from including templates on pages that are not shared and are unlikely to be shared. We've had to come up with other ways to deal with memory issues for specific pages in the past. I don't see why we can't just limit it's use instead of banning it entirely for this reason. Vininn126 (talk) 12:12, 6 April 2024 (UTC)
@Surjection Unless you can point to something concrete, complaints about use of resources are not helpful here. I see nothing about this template which uses anything significant, and vague hand-wringing about it is not productive. Theknightwho (talk) 12:13, 6 April 2024 (UTC)
@Surjection Correct, but it’s also a negligible cost outside of pages which do it to hundreds/thousands of other pages, and I cannot think of any examples where that would happen with this template. Theknightwho (talk) 12:17, 6 April 2024 (UTC)
We currently have 3.888 Latin-script languages. I bet you 99% of them would write the name of Samoa in the same way as English does. This is an issue in the long run (although to me the technical side is much less important than the lexicographical one). Thadh (talk) 12:22, 6 April 2024 (UTC)
@Thadh Which is a hypothetical future problem for when we have thousands of L2s on a single page.
Plus, by far the biggest cost involved is in grabbing the content of the page (which is something we can’t do anything about, since it calls back into PHP), whereas the actual parsing is relatively cheap. I know this, because I spent ages doing time profiles to see what the problem was. In your example, if they’re all calling back into PHP to get content from the same page, then that time cost won’t apply, since PHP uses a cache. Where it is a significant issue is in scraped transliterations, since every link grabs the content of new page(s). Theknightwho (talk) 12:27, 6 April 2024 (UTC)
One consequence that hasn't been considered: this makes the master entries, in effect, templates that can't be given template-editor protection. Any bad edits to these entries will be propagated to all of the other entries, which will make them targets once the vandals catch on. Because they're entries with cross-linguistic significance they need lots of translations, and many of the translations are added by contributors who would be kept out if the page is protected. Page protection would also affect unrelated homographs. It might be possible to create an abuse filter that protects just the sense, with the level dictated by something added to the main entry (maybe a parameter in the senseid?)- but that definitely complicates things. At the very least, {{tcl}} raises the stakes for any decisions made regarding the senses being transcluded. Chuck Entz (talk) 21:34, 6 April 2024 (UTC)
I find the point being made here rather confusing. This dictionary is written in English and maintains a neutral point of view, so it does not refer to any country as a "home country" for instance. It's true that Votic and Dogrib speakers will think differently of Canada, but if the denotations of the words both refer to the UN member state, then the definitions, as written in English, should surely be identical. Further historical or cultural nuances are presented as distinct senses or usage notes. This, that and the other (talk) 05:58, 7 April 2024 (UTC)
The reason I did this was to avoid tons of duplication of definitions. It may seem "obvious" to define Canada as a country in North America, but there are lots of cases where it's far from obvious especially if you want the categorization to work out correctly, and it was very painful to try and keep manually synchronized the definitions of a hundred different terms for e.g. Vatican City. For any country in the Middle East, for example, there are issues such as what are the limits of the Middle East and Western Asia, etc. and how do we indicate that countries are part of both? Even for a relatively innocuous area like Europe, the boundaries of "Western Europe", "Eastern Europe", etc. aren't necessarily obvious and it can be problematic if you get it wrong. For many geographic terms (e.g. Palestine, Jerusalem, Crimea, Artsakh/Nagorno-Karabakh, Macedonia, ...) just coming up with an NPOV definition is hard, and furthermore the NPOV definitions may change over time, leading again to a massive synchronization effort if {{tcl}} isn't used. The technical objections made above seem largely theoretical and speculative to me since I haven't actually seen major issues arising, and I don't at all buy User:Thadh's claim that we need to phrase the definition of a given geographic term differently depending on the language in question. In fact I would say doing so can be quite problematic from an NPOV perspective. Benwing2 (talk) 06:37, 7 April 2024 (UTC)
NPOV has absolutely nothing to do with lexicography. A certain word has a certain meaning in a certain culture, distinct from other cultures, be it a noun, adjective, or a name. That meaning is what we record here, and it is impossible to do lexicography well by only documenting the English definition of the referent, rather than the definition through the speaker's pov. This includes things like omitting specific information which is not primary to the speakers: For instance, Mäkkylä would best be defined as "A village in the Leningrad Oblast" in Russian, and as "A village in Russia" in English. That has nothing to do with NPOV, it has to do with speaker experience. Thadh (talk) 09:08, 7 April 2024 (UTC)
I am perplexed that you want the definitions of Ingrian words to be contextualised for Ingrian speakers, even though these definitions are written in English. It's likely that in the Ingrian edition of Wiktionary, definitions would be written in the way you wish. However, English is a global language. I do not agree with the idea that different languages' entries on English Wiktionary should assume different levels of geographic foreknowledge on the part of the reader. Perhaps not what we usually mean by NPOV, but it's a similar principle. This, that and the other (talk) 11:39, 8 April 2024 (UTC)
@This, that and the other: If you want our dictionary to only be used by English speakers, then I am afraid you'll see the majority of the editors leave pretty quickly. The target audience of the Ingrian entries are Russian, Finnish and Estonian speakers, which is unsurprising, as I doubt you'll ever encounter even one English speaker who has even heard of the word "Ingrian" in his life. So forgive me for not wanting to adapt my definitions to people that will have absolutely never use the entries, and to have information there that absolutely no reader will ever want. Thadh (talk) 11:50, 8 April 2024 (UTC)
@Thadh When you say If you want our dictionary to only be used by English speakers, then I am afraid you'll see the majority of the editors leave pretty quickly., do you not see how this implies the precise opposite of what you're saying? If it's not solely used by speakers of one language, then it's not safe to assume the reader's knowledge or interest on that basis. Theknightwho (talk) 17:03, 8 April 2024 (UTC)
Which is why I don't do that, but rather state that we should adapt the definition on the basis of what the speakers would denote. Thadh (talk) 17:54, 8 April 2024 (UTC)
@Thadh's argument doesn't make any sense to me. Why would English speakers not care that Den Hoorn is in a certain part of the Netherlands? Do you speak for all of us? As an English speaker, I strongly oppose removing this kind of geographical information on any entry. @Surjection's argument is also irrelevant, given that anyone trying to parse Wiktionary will use the HTML output, which isn't affected, rather than the wikitext. Therefore I support applying {{tcl}} whenever possible. Ioaxxere (talk) 20:59, 7 April 2024 (UTC)
@Ioaxxere: If you want more detailed information on Den Hoorn, there is a very good website for you called Wikipedia. Luckily for you, pretty much all our English entries have a handy link (and often more than one) to related articles. We also don't include information on past inhabitants of the village and how many schools there are. There is a reason for that: We are not an encyclopedia. Thadh (talk) 21:05, 7 April 2024 (UTC)
@Thadh Forcing users to go elsewhere because you’ve assumed people don’t care about information just adds inconvenience. What you’re essentially doing is applying the Sapir-Whorf hypothesis to place names, instead of the far more obvious explanation that it’s down to geographic proximity, which changes based on the speaker’s personal circumstances.
I agree that a speaker’s conceptions will change based on the language they’re speaking, but I do not agree that that applies to place names in general (edit for clarity: I’m not talking about poetic terms like Albion, which obviously are affected). Theknightwho (talk) 21:37, 7 April 2024 (UTC)
I'm sympathetic to the idea that e.g. "(an island and city-state in Southeast Asia, located off the southernmost tip of the Malay Peninsula; a former British crown colony)" does not necessarily need to be present on every single language's definition of Singapore (e.g. சிங்கப்பூர்) ... but only inasmuch as I think just defining it as "Singapore", pointing to the English entry, and letting the English entry do the heavy lifting (including noting any associations the place has in any particular cultures) could be enough. (And since {{tcl}} syncs/transcludes this content, rather than duplicating it in a way that would fall out of sync, I think it's fine.) I don't think we can assume that the only people looking up a word in a given language are speakers from the main culture associated with the language, who live wherever the language is most commonly spoken; (native) Chinese speakers living outside China might only have as little need to know what specific sub-area of China Haikou is in as the average English-speaking American (but conversely, members of either group might want to know what specific region it was in); at the same time, they might have correspondingly more interest in exactly what part of Malaysia or the US, where they live, a nearby city (whether named in Chinese, Malaysian or English) is in. And IMO, information like (say) Mount Paektu being important to Koreans and Manchus should be noted in the English entry, not just the Korean entry. Only if a single place truly has salient/definitional cultural significance in a large number of languages would I consider not having at least a copy of all the info in the English entry, if it would balloon the English definition up too much. Regarding the idea that these are "unprotectable templates": if every language's word for "Singapore" just linked to the English entry with no extra details, and someone vandalized it, then people coming from all different entries would still see the vandalism if they clicked through to the English entry... and if they didn't click through, then the vandalism would go unnoticed by them, whereas if the English definition is transcluded across many pages and gets vandalized, more people are in a position to see the issue and bring it to our attention... so IMO it seems like a wash on that front. - -sche(discuss)22:22, 7 April 2024 (UTC)
@-sche: "I don't think we can assume that the only people looking up a word in a given language are speakers from the main culture associated with the language, who live wherever the language is most commonly spoken" - I assume nothing of the sort, but I am absolutely certain that the term denoted is the one that is used by "speakers from the main culture associated with the language, who live wherever the language is most commonly spoken". There is simply a difference between describing a referent and describing a term - when I as a Russian speaker say "Дортмунд(Dortmund)" I denote something different from what a German speaker would denote. The referent is the same, that is true, but the communicated information is not. Thadh (talk) 16:24, 8 April 2024 (UTC)
@Theknightwho: Good luck finding three quotes in Russian proving that the speaker encoded the knowledge that Dortmund is located in North Rhine-Westphalia into their used word, and the best of luck proving that for smaller languages. Thadh (talk) 17:57, 8 April 2024 (UTC)
Okay, great. {{tcl}} can stay for Russian. Please remove it from Ingrian, Votic, Veps, Karelian, , and while you're at it .
I don't see why this would ever be a site-wide decision anyway. Regardless of what English editors may think, why should anyone choose whether or not to use this template for languages other than that language's editors? Thadh (talk) 18:10, 8 April 2024 (UTC)
@Thadh I have no idea why (or on what basis) you think the totality of speakers of these languages are ignorant of information about major cities in Germany such as what region they're in. It's completely baffling. Theknightwho (talk) 10:58, 9 April 2024 (UTC)
@Thadh When using value-neutral terms for places, I encode as much information as the listener is able to infer from their knowledge of that place. Nothing more or less. Theknightwho (talk) 12:21, 9 April 2024 (UTC)
Also, maybe you haven't noticed, but there are currently two elderly speakers of Votic, and just some thirty of Ingrian. Would not be surprised if indeed the totality of these do not know where Dortmund is located. Thadh (talk) 12:21, 9 April 2024 (UTC)
Thadh being illogical again (→ belief perseverance). The correct perspective was outlined by the accusation of perplexity by This, that and the other. If you are a Russian, Finnish and Estonian speaker and use en.wiktionary.org with success then you are an English speaker to some degree, and assume an English speaker’s perspective. Theory of mind. He portrays the issue without a sense of proportion. The Wikipedia stuff doesn’t work either in the way Thadh suggests. I already had the problem of Malaysian place-names cited in 1978 by the Encyclopedia of Islam being unidentifiable to me, the suggestions on كَلَة(kala) being like a third of the mentioned suggestions. It is also psychiatrically interesting that Thadh illustratively expands upon the application of the Sapir-Whorf hypothesis after being called out for it, which is at this point willingly fallacious, stubborn. I mean I don’t say it is a disorder, giving lack of significance pervasiveness, allism must have even maladaptive interaction typically, we are all learning, but I must warn against such subjectivity, this is an unconstructive and dangerous personality trait. Fay Freak (talk) 18:15, 8 April 2024 (UTC)
I don't even know where to start, but I speak English and I still don't know what state any given American city is located in, even if I know English, let alone any city in any other of the hundreds of English-speaking countries and territories. Speaking a language to the degree of understanding glosses does not entail knowing anything about the culture at all. Furthermore, in this day of internet, you don't even have to know English to use our dictionary. Thadh (talk) 18:23, 8 April 2024 (UTC)
That’s why we take extra care about the place-name glosses being comprehensible to the naivest denominator. They are made exact to various granularity within the same gloss and at the same time robot-readable. I still freestyle within these limits and give entries a human touch: Ahlat (a town in Bitlis Province, Turkey; at Lake Van, 40 km northeast from Tatvan along the coastline). You don’t regularly succeed to be more intelligent than that either way. Fay Freak (talk) 18:35, 8 April 2024 (UTC)
@Thadh: The types of glossing aren’t mutually exclusive. English entries like Aksaray have to be improved, they are inexact because we didn’t know formatting, even using images as a replacement for coordinates. And {{transclude}} can have parameters for extra text like {{place}} has. Or if the link in {{place}} has an |id= then we don’t need to use {{tcl}} at all. It’s hardly the hundreds of pages WingerBot caught semi-automatically; I too was doomscrolling my watchlist in December and did not notice the theoretical offence. Fay Freak (talk) 19:52, 8 April 2024 (UTC)
I have a very specific objection to a use of {{tcl}} which I had forgotten. When I looked at Ancient Greek Ἰνδῐ́ᾱ(Indíā),I found that the definition was, "(chiefly historical, proscribed in modern use) India (a region of South Asia, traditionally delimited by the Himalayas and the Indus river; the Indian subcontinent)". What? A Classical Greek word proscribed in modern use? I then looked at the wikicode - {{tcl|grc|India|id=region}}. The problem is that the gloss is picking up {{lb|en|chiefly|historical|proscribed|_|in modern use}} from the English entry. We need some way to stop such labels being picked up.
A problem with correcting that entry is that I don't know what the Greek conception of 'India' was. But.. - while English glosses are supposed to be definitions, glosses for other languages are supposed to be translations. Perhaps I should just change the gloss of the Greek word to 'India'. --RichardW57 (talk) 19:56, 8 April 2024 (UTC)
Or Ἀλαζώνιος(Alazṓnios), anachronistically using a country name in the definition which was invented two thousand years after the attestation. On the other hand, compare Κῦρος(Kûros) using a historically appropriate gloss. Vahag (talk) 20:12, 8 April 2024 (UTC)
That still works, you can’t circumscribe everything comprehensibly in historical terms, which become more ambiguous the more you go back, less the earliest Armenian historic times, which you do not well map year by year though unlike Europe’s maps of the 1900s, but another two millennia before, you see in what I wrote in the etymology section تبریز(tabrīz). And the historical sensibility or diachronic coherence gets lost by people’s manual actions, sigh, compare my original definition of Cyprus, thus circumspectly formulated because I added the Phoenician translation for the island of Cyprus (no country nor political unity existed). There is an inherent bias towards the present natural to language itself, its preservation and transmission up to the imagined readers of our working language, without anyone being Whorfian here. All techniques have intelligbility advantages. Fay Freak (talk) 20:48, 8 April 2024 (UTC)
There is a principal difference between countries (top-level states) and geographical or cultural regions corresponding to them sometimes (through the concept of a nation, such as tied to a demonym). I again applaud Wiktionary and whoever wrote the English dictionary entry Germany which has been afforded meticulousness in this respect. If we think about it like a programmer then they are different objects, which we could fetch. Wikipedians can only complain about it not being made out in the references though, which need to be interpreted as referring to one or the other or both. The concept would simplify historic accuracy, though even ethnicities only form in distinct times (e.g. Bosnians split off Serbians in the 13th century), but then again only most intelligent people like Vahagn, Richard, and Ben would even get the idea of what we try to achieve there, which is not usually expressed well in language – something for a later generation.
Psychology fact: Tasks become easier by sequentializiation and big questions like “What is Russia?” can be projected: At another occasion I point out that even in the legal realm there is the diplomatic/international law answer, the civil law one, which is the de-facto state one (a term invented by Wikipedia), the internal administrative one, then here we have the cultural one (surely went somewhere through Belarus and Ukraine until everyone became sick of being the Russian world as they bombed culture away), which is equal to the regions typically settled by Russians at the Eastern frontiers as opposed to Turks (ethnopluralism, which we define incorrectly btw, is rare, most people distinguish other people(s) by acculturation). Fay Freak (talk) 21:13, 8 April 2024 (UTC)
@RichardW57 There is in fact support for this in {{tcl}}; |nolb=+ or |nolb=1 makes it not pick up any labels, or you can give a semicolon-separated list of labels not to pick up. This needs to be documented. Benwing2 (talk) 21:17, 8 April 2024 (UTC)
I wonder how often labels should vs. should not be transcluded; I wonder if |nolb=+ should be the default. If it's not too much work, maybe someone could make a list of what labels are used on the various definitions around Wiktionary that {{tcl}} transcludes, and how often, so we could get a sense of whether most labels can or can't be expected to apply across languages (e.g., if there are any foreign places for which the US and UK or NZ, etc, use different names, any "US" label would never carry over to Czech — maybe the template even already realizes this, to not carry over English-specific labels). I'm not saying this is a good idea, if it would increase how much memory or other resources Module:languages uses for relatively small gain, but there is also the possibility of adding some kind of "long dead?" / "ceased to be spoken by native speakers in ancient times?" field (precise scope and name to be workshopped) to Module:languages or to some other module, which might need only be present only when the value is "true", from which not only could {{tcl}} know to suppress labels like "historical" or "obsolete" when a language is long-dead (since such labels are likely to be less accurate, and might be better added manually if accurate), but also "long-dead language term borrowed from modern language term" in {{der}}/{{bor}} like "Coptic terms derived from Greek" (permalink) could categorize into a "this is probably wrong, check me" category (in that case, Coptic borrowed from Ancient Greek, not modern Greek). (Again, not sure this is worth the cost, but mentioning it.) - -sche(discuss)13:33, 9 April 2024 (UTC)
@Benwing: This discussion is delving into the theoretical of whether speakers of different languages denote different things, but meanwhile the technical issue with the quotations is not resolved yet. Thadh (talk) 12:50, 9 April 2024 (UTC)
After a disussion with @Theknightwho, it seems the main problems we share with the template is the fact that cultural baggage denoted by the term is now both transcluded (it shouldn't be, as English is a separate language and has separate connotations), and afaict excluded on the individual entries. By "cultural baggage" I mean for instance things like the difference between Burma and Myanmar.
We still disagree on the amount of geographical information that should be added to the entries (I personally am of the opinion that this should be minimal), but in any case things like "Largest country in the world" does not seem to be something that should be transcluded. Similarly, labels should not be transcluded by default, I don't see how that is desirable. @Benwing. Thadh (talk) 13:25, 9 April 2024 (UTC)
Coincidentally, my earlier comment here was prompted by finding a number of instances of the invalid category :Burma] that were only there because someone put |cat=Burma in a {{sensid}} template instead of |cat=Myanmar. While it's nice that all these entries are in synch, the fact that a Tagalog entry has all the mistakes of its English counterpart isn't. IMO we shouldn't use this on entries that are contentious or vandalism-prone in any way. Chuck Entz (talk) 14:04, 9 April 2024 (UTC)
@Thadh I think you are right about not transcluding labels by default. This is behavior inherited the original implementation by User:Fytcha, but it violates the principle of least surprise because often the labels won't be relevant or accurate and it's not obvious to the transcluder what the labels are. I'll change this so they are only transcluded if you use |lb=1 or |lb=+. BTW you should ping me as Benwing2; I don't log into my admin account much so I won't generally see pings to that account. Benwing2 (talk) 05:17, 10 April 2024 (UTC)
@0DF: In my opinion, the fact that Bilohorivka is founded in 1720 is not dictionary material. Furthermore, the fact that Bilohorivka is located in a certain hromada in a certain raion is not in my opinion useful information for the English entry, and definitely for most other languages; They are probably useful for the Ukrainian entry though, which is why I think such information should not be transcluded. Thadh (talk) 17:33, 21 April 2024 (UTC)
I think this illustrates why the suggestion to categorically drastically reduce how much information is present is not workable. If "the fact that Bilohorivka is located in a certain hromada in a certain raion is not in my opinion useful information for the English entry", will you have "A village in Donetsk Oblast, Ukraine." twice, and "A village in Kharkiv Oblast, Ukraine." as nine different senses of Mykhailivka alongside six "A village in Luhansk Oblast, Ukraine."s, etc? Earlier, you even suggested reducing many definitions to just "a city in " or "a city in "; if enough Chinese or Telugu or Nepali (etc) news reports have mentioned different Mykhailivkas, would you put "# A village in Ukraine." dozens of times in a row? Lots of places have this issue: Alexandrovka, Russia, Centerville, etc. The assumption that "a city in " or even "a city in " will sufficiently identify which place is meant is clearly wrong in many cases, and is not a safe assumption in general IMO, not only because multiple places may exist with the same English name, but because those places may not have the same name in other languages, given how prone languages are to having e.g. a more nativized name for one place than another place, or having a name they got from one language for one place and from another language for another place. Hence, when we know factual details of where a place like a particular Mykhailivka is, I think it's reasonable to include them. (Yes, in some cases, it will be unclear which one was meant, but that's a general problem with all words: it's also not clear which species of buttercup many uses of buttercup are about). - -sche(discuss)20:09, 21 April 2024 (UTC)
@Thadh: I just created Andriivka to demonstrate many of the points -sche made in the meantime. You'll see from that entry that there are six Andriivky in Donetsk Oblast alone, and not only are there seven Andriivky in Poltava Oblast, five of those are in Poltava Raion, meaning hromady must be specified to distinguish those five villages as separate senses. In all likelihood, every one of those settlements (I'm excluding the hromada) would have the French translation Andriïvka, the Ukrainian translation Андрі́ївка(Andríjivka), the Russian translation Андре́евка(Andréjevka), and so on, but it would be pretty unsurprising if some of the Crimean Tatar or Mariupol Greek translations varied by sense; assuming such variation, it becomes increasingly impractical to deal with them all in a single translation table. Re the occasional postmodifiers, founding and disestablishment/abandonment/destruction dates give termini inter quos for literary references to those places (as living settlements, at least) and statements about de facto control or occupation clarify by implication that the main definition is de jure; those qualifiers are universal (translingual) in their relevance. 0DF (talk) 21:29, 21 April 2024 (UTC)
@0DF: But we don't distinguish the various meanings a term can have in other languages, we distinguish the meanings a term has in the language that the entry is in. We can deal with various terms for various Andriivky using qualifiers in the translation table, as we do with countless examples in other places, because English simply doesn't make the same distinctions as some other languages do. Compare for instance buzz#Translations_2, where Finnish makes a distinction between three different subsenses which are translated differently in Finnish, but are not distinguished in English. Splitting a translation table just for that is what's impractical.
A translation of the kind "One of various villages in the Poltava Oblast of Ukraine" would be enough for an English entry. Thadh (talk) 21:38, 21 April 2024 (UTC)
@Thadh: When people with no knowledge of chemistry talk about water, are they not all referring to a substance whose molecular structure is H₂O? 0DF (talk) 21:56, 21 April 2024 (UTC)
@0DF: No, they're not. And if there were a substance with a different molecular structure that would quench thirst and be wet, colourless and tasteless, it would also be called water. In fact, anything that doesn't fail any of these parameters to the speaker's knowledge would be water. Thadh (talk) 21:59, 21 April 2024 (UTC)
@Thadh: I think we're going to have to agree to disagree on this qua principle regulating how we should compose definitions. 0DF (talk) 22:53, 21 April 2024 (UTC)
@0DF: The issue with the {{tcl}} template being applied on the entire website means we cannot "agree to disagree"! I would love to just edit the languages I edit without getting constant interference from people who don't know these languages and don't want to, but recently I find myself constantly having to deal with these issues that have never arisen before. Why should an Ingrian entry for New York include the information that it's the "largest city in the state of New York and the largest city in the United States, a metropolis extending into neighboring New Jersey" instead of just stating it's "A city in the United States"?
And this applies to other topics discussed on the Beer Parlour in recent times as well: Why should an Ingrian etymology section link to the word "Borrowed from" Russian and show the term "Inherited from" Proto-Finnic? Why would an Ingrian entry with two or three definitions have "antonyms of sense" in the antonym section as a qualifier, rather than just the sense? I understand that you personally haven't participated in all these discussions, but this is getting more and more frustrating, having to deal with these ideas which are obviously centered at English (and perhaps a couple of other large languages), but are absolutely worthless for the majority of languages. We are currently the most usable, most complete dictionary of Ingrian in the world, I think our readers can live with having to figure out that antonym sections' sense template shows senses of the entry (rather than the antonym), they have to figure out how the dictionary works anyway! Having a bunch of text to "clarify" it just makes it all much more complex and offputting for any reader who doesn't enjoy wasting time. Thadh (talk) 23:10, 21 April 2024 (UTC)
@Thadh: These are side-issues, but why don't you use {{bor}} and {{inh}} instead of {{bor+}} and {{inh+}} if you don't like the latter's preambles? And why don't you use {{ant}} instead of Antonyms sections if you don't like the wordiness of "antonyms of sense"? 0DF (talk) 00:27, 22 April 2024 (UTC)
@0DF: I do use the simple templates, but I had to fight like hell for those to even be allowed in entries. As for the inline-antonyms: I have up to three quotes per sense, all inline, and sometimes up to four nym-sections. Inline nyms are not a good idea. And I don't understand why something that has worked well for the past ten years should suddenly change. Thadh (talk) 07:11, 22 April 2024 (UTC)
{{ant}} was created in 2017, hard to call inline nyms a "sudden change". They also are generally preferred by people I ask - they say it's easier to understand what's a nym in relation to what sense. The other way works but is messier. Vininn126 (talk) 07:13, 22 April 2024 (UTC)
@Thadh: I also occasionally grumble to myself about certain changes with the thought of “did this really have to be made more difficult to understand or modify?”, but it's a very ephemeral thing when I do. On the whole, I just trust that most people are trying to improve things. Some people seem misguided in those attempts, but on the technical side of things, the main editors I'm aware of (Benwing2 and This, that and the other) are competent and thoughtful, so I just trust, when I don't understand exactly what they're doing, that their changes are sensible. I can only recommend picking your battles wisely and trying to be adaptable otherwise; resentment won't do you any good. And since I neither edit Ingrian nor enforce the use of {{tcl}}, we indeed can agree to disagree on this. 0DF (talk) 19:53, 22 April 2024 (UTC)
IMO if the consensus is that definitions of similar items in different languages should read similarly across the site (which is certainly what I believe), you should follow this e.g. for Ingrian even if you disagree with it. This particular issue doesn't seem to me like something that should be up to an individual language's editing community. Benwing2 (talk) 23:15, 21 April 2024 (UTC)
Also, maybe you haven't seen it in a while but our Chinese entries are the complete opposite of this "reading similarly", and I haven't seen anyone complain about that as much as I've seen these countless optimisations and regularisations being pushed onto smaller languages. Why would you ever think many of these changes would benefit anyone, editor and reader alike, is truly beyond me. Thadh (talk) 23:25, 21 April 2024 (UTC)
@Thadh Quite a lot of people have expressed that opinion in this thread, and I think you know very well that they edit a wide array of languages. I appreciate that you believe each language's community should have broad control over how their entries should look, but I also note that you only tend to invoke that when you're also claiming that you're the sole editor of a particular language, and that isn't really how Wiktionary consensus works. Theknightwho (talk) 00:46, 22 April 2024 (UTC)
@Theknightwho: That's not true, I invoke it all the time, I just happen to be the sole active editor of most languages I edit. But for Kashubian for instance, where I am one of many editors, I have the same ideas, yet follow the consensus of other Kashubian editors. Thadh (talk) 07:07, 22 April 2024 (UTC)
┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ I sympathize that you don't like seeing more information than you are interested in, and re "a bunch of text to "clarify" ", I've now encased the two extra words in a <span> so you can add .antonym-clarification { display: none; } to your css like this and {{antsense}} will display the same way as {{sense}}. I guess someone could modify the placename template to similarly encase raions (etc) in a CSS class that users could opt out of seeing, but I think this discussion demonstrates that just because you are not interested in the information doesn't mean that no-one is interested in it. I mentioned migrant communities (and others mentioned tourists, and news media) as calling into question the idea that speakers of a language are not aware of or interested in the details of places far from where the language is most commonly spoken; you asserted that nonethless "I am absolutely certain that the term denoted is the one that is used by 'speakers from the main culture associated with the language, who live wherever the language is most commonly spoken'", but ... it seems to me like your own idea that people in Ingria or Vietnam or wherever else are uninterested in the finer details of the location of a city somewhere far off (say, the US) — the idea that "a city in America" or even "a city in the Americas" is all they care about knowing — suggests that the few Vietnamese (etc) speakers most likely to speak about such a city, the people creating most of the uses of the Vietnamese word for that city, may be the Vietnamese immigrants who live near it and know and "encode" exactly where it is. More generally, any time we have an Ingrian or Chinese, Vietnamese, etc word for New Orleans or New York because an Ingrian or Chinese or Vietnamese author used the word for it in a text, and then someone else comes here and specifically looks that up,* I just ... don't share the belief that neither of those people will have been interested in where the place is beyond "it's the name of various places in the Americas". (*And it seems relevant to me that that's what's happening; we aren't stopping people in Asia on the street and randomly accosting them "did you know New York is the largest city in the state of New York?!": someone is coming here specifically to look for an English-language definition of what the Vietnamese (Ingrian, etc... or even English) word for New York (Bilohorivka, etc) is. In that context, I don't share the idea that they'll be so uninterested in the details of where it is that they'll be put off if we spend a few extra words offering that information to them.) - -sche(discuss)04:47, 22 April 2024 (UTC)
@-sche: That's simply not good enough. What you're doing is acting as if everything I'm speaking of is my personal preference and an extreme minority opinion among the readers. Personalised CSS is only possible for logged-in users. The readers who will be put off from the enormous wall of text rendering their target language unreadable will not stick around long enough to create an account, find this discussion, and change their CSS - they'll simply stop using us.
As for the topic at hand: If we suddenly get an Ingrian speaker who lives in the US, but doesn't know New York and wants to know that it's the largest city of the state New York, then they can simply click the link to the English entry. Once there, if they want to know more still, they can click on the giant Wikipedia box on the right and read an encyclopedic article on the city. But the fact there may be someone who may have to go through those two clicks doesn't excuse a wall of text on the original entry.
Same thing for Andriivka: If you're interested in which Andriivky specifically are meant, you click on the giant floating box and see for yourself.
And I repeat, this is not me who has personal preferences and wants everyone to follow them. It is me advocating for the readers that I bring in. This can't be solved by personalised CSS or preferences, this is major issue and readers will leave because of it. And when readers leave, editors do, too. Thadh (talk) 07:39, 22 April 2024 (UTC)
@Thadh: And some languages use different words for water depending on what it's fit for, such as 'drinking water'. And the usual Thai word for water, น้ำ(náam), refers to fluids in general, such as oil, and in some compounds, to the solid residue left after driving H₂O off, e.g. น้ำตาล(nám-dtaan, “sugar”). --RichardW57 (talk) 23:09, 21 April 2024 (UTC)
Copying rhyme syllable counts from existing categories
The {{rhymes}} template takes |s=, which specifies the number of syllables in the relevant pronunciation of the term (not in the rhyme itself). This allows the template to categorize the term in e.g. Category:Rhymes:English/iː/1 syllable. Many English entries lack these syllable counts but are in categories like Category:English 1-syllable words. I've written a Python script to find English terms that are in exactly one of these syllable categories and that already have a rhyme with no syllable count specified, and add the syllable count specified by the category. I'm seeking consensus to run it (under my bot account) over all "English N-syllable words" categories. As usual I will start off slow to allow me to catch bugs early. — excarnateSojourner (ta·co)01:46, 7 April 2024 (UTC)
@CitationsFreak This seems fairly innocuous to me. I think User:Surjection has already done similar runs for certain languages. Ideally this wouldn't be necessary and we'd have a pronunciation module that would automatically generate the pronunciation from a respelling along with the rhymes, but we're a long way off from that for English. The only issues I can really think of are cases where there are multiple possible pronunciations with different numbers of syllables, where the different pronunciations are tied to a specific dialect (e.g. secretary, normally 4 syllables in the US but 3 syllables in the UK), and for which the rhymes are different per dialect and thus the syllable counts need to be synchronized to the rhymes. Whether this actually happens I don't know, but you might want to generate a list of all the pages that have both multiple syllable counts and multiple rhymes, and manually review them to see if there are any of this nature (or just tell your bot not to touch them, and do them by hand). Benwing2 (talk) 07:31, 7 April 2024 (UTC)
Pinging @-sche because you often have thoughts about things like this. Currently we have two labels in Module:labels/data/regional, UK (alias United Kingdom) and Britain (aliases British, Great Britain, Brit), which display differently but categorize identically into British Foo where Foo is currently one of the five languages Bengali, English, Urdu, Vietnamese and Chinese. Yes, I know the UK and Great Britain aren't the same, but is there really enough of a linguistic distinction to merit two labels? I am 90% sure these labels are used promiscuously, with editors more or less randomly choosing one or the other. Given this, should we merge them? Benwing2 (talk) 07:00, 7 April 2024 (UTC)
Aren’t they merged in their categorization already? Because that’s only what I could propose. Otherwise I let editors write what they want to write. To avoid frustration that is.
I have added Irish English often enough, and with some likelihood “UK” means a term relevant due the Kingdom’s politics or federal (yikes) legal system, whereas ”Britain” means that I think a term is said in the ends in Scotland too since I know it from Northern England whereas Northern Ireland I would reserve to check. Your mileage differs no doubt.
Doesn’t really matter that editors are inconsistent in theology, for, as outlined, the editing process is affective rather than based on thorough linguistic field study, lucky googling + “I feel like it” (impeccable for experienced editors, who know what the dictionary profits from, don’t get it twisted). Fay Freak (talk) 08:40, 7 April 2024 (UTC)
To me "UK" seems less ambiguous, and so preferable (except for the fact that it might sound less pleasant to the ear?). I don't live there, so maybe I'm missing something.--Urszag (talk) 08:45, 7 April 2024 (UTC)
Technically, "Britain" excludes Northern Ireland while "UK" does not, but (a) I don't think that's a very meaningful distinction, as Scotland and Northern Ireland form a far more coherent linguistical unit than Great Britain to the exclusion of Northern Ireland, and (b) "British", the adjective, doesn't exclude Northern Ireland, so if we're categorising terms as "British X" then it makes more sense to use the "UK" label. Theknightwho (talk) 13:28, 7 April 2024 (UTC)
IIRC, a main reason these exist separately is that sometimes people wanted a noun (that didn't require typing "the" in front of it every time) and sometimes people wanted a word that fit in an adjectival/attributive slot. This is partly so that in {{label}}s people can write "dated in Britain" vs "UK dialects" (instead of wrong-sounding "Britain dialects") — I dimly recall "British" may have displayed as such at some point too, instead of being aliased to "Britain" — and partly because {{label}} isn't the only place these are used, there's also e.g. UK form of foobar vs wrong-sounding British form of foobar. (This issue of needing {{standard spelling of}} et al. to display something different is also why we hackily have "British spelling" and "British form" as different labels, btw; see this June 2020 TR discussion and the other discussions linked there.) Offhand, it seems like uses of "Britain" could be folded into "UK" or, in a very few cases in labels, "the|_|UK", but I'd want us to first make sure these aren't also used in other places we haven't thought of where that wouldn't work. I am inclined to agree with the goal of merging them, because they technically refer to different terms as TKW says, but I doubt even 10% of uses intend to be conveying different things, so having the difference is basically creating inaccuracy (not to mention that they're not distinguished in categorization). I am reluctant to even mention this, because I fear some people will use it as a reason to keep separate labels, but: another difference they could theoretically convey if there was any chance — which there clearly isn't, looking at how they're used at the moment — of people ever maintaining this difference, is that "Britain" existed at times when the "UK" did not, so theoretically some entry for a word that went obsolete centuries ago might technically be more accurately labelled "Britian" than "UK", but such a case could (and probably better should) be tagged with the relevant more specific labels like "England, Scotland, Wales" instead. - -sche(discuss)15:38, 7 April 2024 (UTC)
@-sche So ... I actually introduced the capability in labels of having a language-specific postprocessing function, which is currently used in Chinese so that e.g. {{lb|zh|Jilu Mandarin|Jiaoliao Mandarin|and|Jianghuai Mandarin}} displays as (Jilu,Jiaoliao and Jianghuai Mandarin) and {{lb|zh|Zhangzhou}} displays as (Zhangzhou Hokkien) but {{lb|zh|Zhangzhou|_|Hokkien}} also displays as (ZhangzhouHokkien) rather than as (Zhangzhou Hokkien Hokkien). Such an approach, or maybe some modification of it maybe with label-specific settings, could potentially be used to correct the display issues you've mentioned above without actually needing separate labels. I'd just need a full description of what should be displayed in which circumstances. Benwing2 (talk) 18:22, 7 April 2024 (UTC)
For "Britain" vs "UK", I (at least) wouldn't bother trying to have the template/module guess which one to display, because the odds of us being able to set it up to always display "Britain" in only those miscellaneous situations where that is desirable (without it accidentally displaying in situations where it is undesirable), and vice versa / mutatis mutandis for "UK", seem slim to me, and the benefits even if we do it right seem small; it seems easier to convert everything to "UK" and maybe manually add "_|the|_|" in the hopefully few places where "the UK" would be more euphonious than "UK". For "British form" vs "British spelling", though... if the "spelling of" and "form of" templates like "alternative form of", "standard spelling of" etc could know to delete the word "spelling" from the display form of "British spelling" — so that {{altspell|en|foobar|from=British spelling}} displayed "British spelling of foobar" instead of "British spelling spelling of foobar" — then I think the labels "British form", "Canadian form", etc could be reduced to aliases of the "British spelling" labels, although we should check first whether those labels ("British form", "Canadian form" etc) have come to be used anywhere else in the years since I set them up... I do spy one use in from A to Zed which needs to be changed (probably to just say {{lb|en|UK}}) before we alias "British form" to "British spelling". - -sche(discuss)19:15, 7 April 2024 (UTC)
@-sche It occurs to me there's a very simple solution to this issue, which is to provide a way of indicating that the label should display as written, e.g. {{lb|dated|in|!Britain}} which means that Britain should display as written rather than canonicalized to "UK" or whatever. It could easily be argued that this should actually be the default, but I don't know the ramifications of that. Benwing2 (talk) 04:08, 8 April 2024 (UTC)
Hmm. That might be useful in some situations which are currently handled by having different labels (with the same categorization or whatever), but the downside, especially if we allow that for all labels and their aliases, is that then (as people start to use it even in cases where it's the only label, not part of a "dated in..." or the like), lots of entries will start displaying different things, suggesting there is a difference. If some entries say "Canada" and others say "Canadian", I suspect anyone who actually notices the difference may wonder what it's trying to convey (is Canada the topic label, and Canadian the dialect label?), and if the answer is "we're not trying to convey a difference", then why do we have a difference? I should clarify that my initial comment in this discussion was not in support of these being separate, just answering what the reason people kept them separate was; I actually feel the same way about "Britain" vs "UK" as about "Canada" vs "Canadian": that anyone who actually notices the difference is liable to wonder if we're actually trying to convey that the words are used in different sub-areas of the British Isles. I don't think adding "|_|the|_" to labels that need to display "dated in..." is onerous. But especially given recent cases of pushback to template changes, maybe we should ping some more British editors to make sure they're onboard. - -sche(discuss)22:32, 11 April 2024 (UTC)
@-sche I have thought instead of making this opt-in only for certain labels, e.g. the labels data can mark that Britain should stay as such even when it's an alias of UK. There are cases where non-equivalent labels were being aliased (e.g. South Midlands as an alias Midlands), which is problematic when the display gets changed. I have solved this so far by separating the labels entirely but this is a bit annoying to implement. Benwing2 (talk) 00:13, 12 April 2024 (UTC)
@-sche I implemented ! preceding a label to indicate that the label should be displayed as-is instead of converted to its canonical form. This is useful e.g. for yallah, which is labeled as Arab|_|!Australian so it displays as Arab Australian; otherwise it would show up as Arab Australia, which sounds wrong. Benwing2 (talk) 23:13, 16 April 2024 (UTC)
@-sche Specifically referring to your point about Britain existing before the UK, I think that any terms which fall within that period (1707-1800) are better labelled as "18th century". If the country is absolutely necessary for whatever reason, it would be best to use the term "Great Britain", which is the period-equivalent to "UK" (which is what it became at the start of 1801). Theknightwho (talk) 23:32, 7 April 2024 (UTC)
Ok, so first of all, I have listed "Old Lombard" as a dialect of Lombard. However, Old Lombard was spoken in the 13th-14th centuries, a fact I got from Lombard Wikipedia, meaning it was spoken in the same time as Old Spanish and Old French, etc. So I guess we could just add a code like roa-lmoa, (the final a for antich). That Northern Irish Historian (talk) 14:40, 10 April 2024 (UTC)
Definition of a neologism for loanwords
After the Spanish occupation, Tagalists in the early 20th century (Tagalog enthusiasts and promoters, who eventually became or influenced the Filipino language committee members) were promoting the use of Tagalog for academic abstract terms such as for science and arts because the terms used back then were heavily depending on Spanish. They started coining new terms and intentionally borrowed (not by natural contact) other Philippine languages' terms (and also Malay) such as Cebuanobatas for Spanishley(“law”), Cebuanokatarongan for Spanishjusticia(“justice”), Malaybangsa for Spanishnación, and Malayguru for Spanishmaestro(“teacher”). All of these made it to common use up to the current period as Tagalogbatas, katarungan, and bansa, guro. Since they were initially borrowed but made it to the "mainstream" use, some people would never think of these as neologisms anymore.
Now, Tagalog has native words araw(“sun; day”) and buwan(“moon; month”) and the words can mean both the celestial object meaning, and the period of time meaning they were assigned to and can be perfectly understandable with context. In addition, Spanish had separate words for "sun" and "day" which are Spanishsol(“sun”) and Spanishdía(“day”) respectively. Likewise, Spanish also has separate words for "moon" and "month" which are Spanishluna(“moon”) and Spanishmes(“month”).
With the sun/day, moon/month distinction of Spanish and with the goals of being "inclusive" to create a true "Philippine" language, there was a proposal back then as well to borrow Cebuanoadlaw(“sun; day”) and Cebuanobulan(“moon; month”) but the Cebuano terms would only refer to the celestial objects sun and moon, and araw and buwan would be used for the time periods, day, and month. The separation of words to refer to the time period and the celestial object did not made it to mainstream use unlike ther terms in the first paragraph and Tagalog still used the native terms up to today.
Currently, in Wiktionary, Tagalogadlaw(“sun”) and Tagalogbulan(“moon”) are listed as neologisms since it was not used practically, nor added in the common dictionary but still listed in some books introducing neologisms such as Maugnaying talasalitaang pang-agham Ingles-Pilipino Literally, “Relational Scientific Vocabulary English-Filipino” and still being talked about in some papers or whatever that it can actually satisfy the Criteria for Inclusion.
A user thought that the neologism label for these words should be removed due to the definitions provided at Wiktionary:Neologisms which are the following:
A more precise sense of neologism which has gained some support on Wiktionary is a word that
a) is new and perceived as new, although there is no precise age cutoff for newness;
b) has not yet been recognized as part of the standard language (often being written with scare quotes);
c) is not slang, colloquial, very informal, or technical, and;
d) is not merely derived from an already-existing term with no unexpected change in meaning, such as in the case of clippings, loanwords, and abbreviations.
A neologism that becomes part of the standard language should have the "neologism" label removed. A neologism that fails to become part of the standard language after an extended period of time but is not a protologism should be labeled nonstandard.
The user has said that they can just be interpreted as regular loanwords and not as neologisms.
However, I think adlaw and bulan by this definition are
1. new in the sense that an existing term araw/buwan already existed but the sun/day moon/month concept was being introduced, coinage in intention so could be arguably a protologism
2. did not become part of the standard Tagalog language, and you may only understand bulan and adlaw more likely if you come from the regional speakers such as Cebuano and the native terms are still used without separation
3. not slang, nor colloquially derived as it was introduced by intellectuals
However it failed number 4 because it is a loanword, an existing loanword with the same definition, but I would argue that the word was not borrowed naturally (ex. Cebuano speakers influx, interaction nor Cebuano getting political power to have their language used by Tagalogs).
I still think we should count these as neologisms - they are loanwords, but they're not normal loanwords. If they're not neologisms, they are surely something else, since these have not come about naturally, but through a "top-down" approach in language development. — SURJECTION/ T / C / L /19:03, 10 April 2024 (UTC)
Yet some learned borrowings are actually commonly used in their target languages; this word isn't. If it weren't borrowed from another language, everyone would agree to call it a puristic neologism. — SURJECTION/ T / C / L /19:24, 10 April 2024 (UTC)
In my quest to reduce the number of modules enumerating language-specific varieties I discovered yet another one, which is Module:accent qualifier/data. This is a real mess as labels from multiple languages are all jumbled together. Since there is no language code currently associated with {{a}}, clashes are a real problem and are solved in all sorts of ad-hoc ways, e.g. inexplicably, Lahore displays as Lahori Urdu but Lahori displays as Lahori Punjabi. I think the only way to make this clean is to add a language code to {{a}}. This would make it possible to separate the per-language uses and ultimately eliminate Module:accent qualifier/data entirely in favor of the label data. It would also make it possible to have some labels categorize, if we wanted that. Thoughts/support/opposition/etc.? Benwing2 (talk) 03:01, 11 April 2024 (UTC)
@-sche I wrote a script to analyze existing uses of accent qualifiers overall and by language. See the results in User:Benwing2/analyze-accent-qualifier-20240420-dump. The good news is most accent qualifiers are used only by one or occasionally two languages, so disentangling them shouldn't be too hard. In the {{a}}-vs-{{lq}} topic I've been thinking it would be better to add a lang code to {{a}} and repurpose it as a general "non-categorizing labeler". I'm thinking it could stand for something like ancillary label or auxiliary label: "label" in that it works with the same labels as {{lb}} does, and "ancillary" in that it adds extra info to an existing something (pronunciation, synonym, derived term, etc.) that isn't the term itself (hence it doesn't categorize the term). Benwing2 (talk) 04:44, 23 April 2024 (UTC)
BTW only 265 of 4,328 distinct labels occur with more than one language; often with only two languages where the second language uses the label only once. Of these 265, 147 of the labels begin with a lowercase letter, meaning they are typically things like informal, nonstandard or misspelling rather than lects. Benwing2 (talk) 06:34, 23 April 2024 (UTC)
I mean, for my part I'm down with this, but I suspect it'll be enough of a change in people's habits that it'd be good to give people ample time to notice this and complain (😅). (Even splitting by language won't solve some of the possible sources of confusion, like "NY" being "New York" but "CA" being Canada not California, and "GA" being General American not Georgia.) - -sche(discuss)16:27, 23 April 2024 (UTC)
@-sche How much time do you think is enough? It's been 12 days so far; I'm thinking a month should be enough. Not really sure how else to ensure everyone gets their say. BTW longer-term I think we should eliminate confusing labels like CA and GA in favor of slightly longer but unambiguous ones; but that can come after adding a language code and unifying accent qualifiers with labels. Benwing2 (talk) 23:49, 23 April 2024 (UTC)
Oh, a month should be enough. Regarding replacing "GA", I'm ambivalent, I do appreciate that it matches the other main labels (RP, UK, US) which are all two letters, and I suspect people are awful used to being able to type just two letters here, too, so maybe it's fine as-is. In theory, someone who meant for it to display "Georgia" should notice that it doesn't, and it would be logical for their next step to be to try writing "Georgia" (and that would work). - -sche(discuss)22:26, 27 April 2024 (UTC)
@-sche Yeah I see your point, although in my experience a lot of people don't manage to notice when things are broken and just "assume" such-and-such-abbreviation will work. In any case any change of this sort would come some time after adding the lang code. Benwing2 (talk) 23:01, 27 April 2024 (UTC)
There appears to be consensus for this, so I'm going to move ahead with it, following these steps:
Fix up internal callers of Module:accent qualifier (e.g. Module:es-pronunc) to pass in a language code. AFAIK, this will not be hard, since I think all internal calls occur in lang-specific code.
Modify Module:accent qualifier to accept but not require a language code in |1=. If |1= looks like a language code and |2= exists, the template is assumed to have a language code. There are only 29 instances in the May 1 dump (out of 100,000+) where this goes wrong; an example is {{a|pa|pan}} in وکھانا(vikhāṇā), where pa here is a code for Standard Punjabi (and pan is a code for Indian Punjabi, which displays as India). Here "going wrong" just means the first qualifier won't display, until the bot has a chance to catch up. Also modify Module:accent qualifier to add a tracking category to all occurrences of {{a}} that don't use a language code.
Modify those 29 or so instances mentioned in (1) to have a lang code. I will do this shortly after the May 20 dump comes out (which is only in a couple of days), so it will likely catch all the instances of this nature.
Run a bot script to modify the remaining instances to have a lang code, inferred from the section it's in, or the page name in the case of Rhymes:... pages, or based on a manually curated list for Appendix pages and such. A special case is Westrobothnian cleanup pages, which have lots of instances of {{a}} on them; I'll use und for these, as the code for Westrobothnian has been deleted. The script will skip cases that would be identified as already having a lang code using the algorithm in (1), so that in case anyone modifies an occurrence of {{a}} to have a lang code while the bot is running, the code won't get double-added. This also simplifies doing multiple runs in parallel, and given the number of pages involved (around 76,384 as of the May 1 dump), I'll do several runs in parallel.
Rerun the bot script on pages still in the tracking category that was created in (2), to catch cases where someone added an occurrence of {{a}} without lang code while the previous script was running.
After that, it will be possible to have lang-specific accent qualifiers, and I'll be able to begin the process of merging accent qualifiers into labels.
Note that the only case where the above steps go wrong is if someone adds an instance like {{a|pa|pan}} while step (4) is running, which seems rather unlikely. Benwing2 (talk) 04:21, 19 May 2024 (UTC)
Update: Step #1 is done. I also added support for overall |a=, |aa=, |q= and |qq=, and term-specific |aN=, |aaN=, |qN= and |qqN=, to {{IPA}}. Multiple accent qualifiers specified using |a=, |aa= or the term-specific variants can be specified by separating them with commas, without a space after the comma. The requirement for no space is so that embedded commas can occur in accent qualifiers, which they sometimes do. (There were only three existing accent qualifiers containing a comma not followed by a space in the May 1 dump, and in all of them the comma needs to be interpreted as a qualifier separator, not as an embedded comma.) On average the use of |a= saves 4 characters in a no-lang-code-for-{{a}} world, and 7-8 in a with-lang-code world; the corresponding figures for |aa= are 3 and 6-7. Out of 76,500+ pages using {{a}} or {{accent}}, in 70,500+ pages the existing {{a}}/{{accent}} could be incorporated into {{IPA}} (around 141,200 instances, i.e. a little more than 2 instances per page), and only 8,713 pages needed a lang code added to {{a}}/{{accent}}. The reasons for being able to incorporate {{a}} into {{IPA}} are varied:
use of {{enPR}} before {{IPA|en}}, which should eventually go away when {{en-IPA}} is written;
{{a}} before {{ca-IPA}}, {{gu-IPA}} or the like; conceivably, |a= could be added to these;
use of {{a}} next to {{homophone}}, {{hyphenation}} or {{rhyme}} (either these already accept |a= and |aa= like {{IPA}} now does, or should be made to);
use of {{a}} inside of the gloss for {{audio}} (a param for this should be added to {{audio}}; it's already supported in {{es-pr}}, for example);
use of {{a}} in foreign-language definitions, e.g. something like ] {{a|UK}}, ] {{a|US}}.
For the latter, I prefer writing just ]/] or similar, which I think is more concise and plenty clear. But I'm thinking of making it possible to leave the lang code blank in {{a}} and have it handle only lang-independent labels in that case (which should include English-language usage labels like UK, US, RP, GA etc. for precisely this reason).
@RichardW57m The issue with Crimea is definitely an edge case, and there isn't even a CAT:Crimea or CAT:Kherson. We can leave disputed cases like this without any country in them; this is not a problem. The vast majority of states and provinces are not disputed, however, and as the issue with Punjab shows, there's a real problem with ambiguity when the country is not mentioned. Do you still oppose if we leave any problematic cases without an attached country? Benwing2 (talk) 22:20, 11 April 2024 (UTC)
Can't we do it the other way around? Make a list of duplicates and make it obligatory there? I'm personally fine with both approaches, but I do think in some cases explicitly stating the country may be overkill and/or potentially problematic. Thadh (talk) 22:26, 11 April 2024 (UTC)
@Thadh That is possible; personally I like stating the country because e.g. I've never heard of Gunma Prefecture or Lampung, and I think it helps users if we give more context. Can you give examples where it's problematic to state the country (other than the already-identified cases like Crimea, and a few others that come to mind, such as Abkhazia, South Ossetia and Gaza)? Note also that these problematic cases have to be handled with special-purpose code in any case because their category text identifies the country they're part of. Benwing2 (talk) 22:37, 11 April 2024 (UTC)
I was thinking of regions with strong nationalistic movements, but no majority support for independence yet, or alternatively no international support for it. Things like calling Catalonia a part of Spain might strike a nerve with some people.
Think also of regions in Myanmar that are currently not controlled by the government. These are not disputed between multiple countries, they're disputed between one recognised country and a rebel group, which makes it pretty complex, and also very susceptible to rapid changes. Thadh (talk) 00:12, 12 April 2024 (UTC)
My general impression is that Wikimedia Commons and English Wikipedia encounter similar naming issues when there are two places with the same name, and those websites deal with these naming issues in more or less haphazard fashion, as Wiktionary does. A correct solution would include an extensive review the policies on those websites so Wiktionary could make an intelligently concieved standard policy. What a solution would be- I cannot say. --Geographyinitiative (talk) 22:45, 11 April 2024 (UTC)
Когда они вышли, карета Вронскихъ уже отъѣхала. Входившіе люди все еще переговаривались о томъ, что случилось.
Kogdá oní výšli, karéta Vronskix užé otʺjéxala. Vxodívšije ljúdi vsjo ješčó peregovárivalisʹ o tom, što slučílosʹ.
When they went out the Vronskys' carriage had already driven away. People coming in were still talking of what happened.
The differences are basically:
|text= the original text from a modern paper book edition |t= English translation
|text= the text from a modern paper book edition, modified to add stress accents and "ё" letters |t= English translation
|text= the text from a modern paper book edition, modified to add stress accents, "ё" letters and wikilinks for all words |t= English translation
|text= the original text from a modern paper book edition |tr= romanized transcription with the added stress accents and "jo" where appropriate |t= English translation
|text= the original text from a modern paper book edition |norm= normalization of the Cyrillic text with the added accents and "ё" letters |t= English translation
|text= the original text from a pre-reform paper book edition |tr= romanized transcription with the added accents and "jo" where appropriate |t= English translation
|text= the original text from a pre-reform paper book edition |norm= normalization of the Cyrillic text to modern orthography with the added accents and "ё" letters|t= English translation
Right now the variant 3 is used in Wiktionary. With some assistance from a presumably @Benwing2's bot correcting the quotations (e.g. this diff or this diff). But the conversion to modern orthography and the addition of stress accents and ё letters can be alternatively done automatically by a Lua module, allowing to implement the variants 4, 5, 6 or 7. With an extra benefit of having an instant feedback to the human, who is editing a Wiktionary article. See the technical discussion here and a working demo of such automatic conversion at Module:User:Ssvb/ru-autoaccent/testcases. The automatic conversion can be also amended via |subst= overrides for the parts of text that the automatic converter can't handle on its own due to the ambiguity of уже́(užé) vs. у́же(úže) or все(vse) vs. всё(vsjo). It's also possible to override the whole sentence via |norm= parameter.
PS. I also noticed that the 1903's edition of "Анна Каренина" used the word "входившіе" ("coming inside") and the 1970's edition changed it to "выходившие" ("coming outside"). Just shows that we can't always fully trust the modern editions of books, so preserving the original pre-reform orthography from the old book editions may be useful in quotations.
@Ssvb Hi, I've meant to respond earlier. In terms of the above variants, I would definitely be opposed to variants 4 and 6 where the stress is included only in the transliteration. If we take the approach of including the original unaccented, un-ё'd, unlinked text in the |text= param, we should use the |norm= param to include accents, ё's and links. Also I've been thinking there are ways in |subst= of avoiding having to repeat the unaltered text in most circumstances; e.g. if there's only one уже in the text, writing уже́ by itself instead of уже/уже́ should be enough. I've already taken this approach elsewhere in the Czech, Portuguese and Catalan pronunciation modules. I should also add, there are several edge cases your auto-accenting code really needs to handle properly; I had hoped by linking to my offline script you would glean those edge cases from the script, but you mostly dismissed them as bells and whistles (some are, some aren't). For example, in cases like до́ смерти(dó smerti), there's a multisyllabic word that must remain unstressed because of the preceding stressed preposition, whereas a naive approach would stress it. Benwing2 (talk) 07:01, 11 April 2024 (UTC)
@Benwing2: I haven't dismissed your offline script. I only mentioned that some of its functionality is clearly out of scope of my module. For example, the creation of wikilinks for the lemma forms of words. I don't think that doing lemmatization is practically feasible inside of a Lua module due to a much higher resources usage required for that. Additionally, I initially intended to implement auto-accenting for Belarusian quotations. And the code for processing the Russian "ё" from your offline script wasn't applicable to my use case (the dots above "ё" are mandatory in Belarusian texts and can't be omitted). I decided to put aside my Belarusian module plans and started implementing the auto-accenting code for the Russian language precisely because of your feedback. I believe that it would be easier to reach consensus and avoid friction when we are on the same page.
I like your idea about making the usage of |subst= simpler. Also thanks for your до́ смерти example. I have added it to the list of testcases and will update the code to handle it properly. BTW, my auto-accent code doesn't edit wiki pages, so it has no potential to do long lasting difficult to reverse damage. Of course, problems in it preferably should be fixed now, but there's no harm in initially deploying it even with a few minor bugs. The growth of the Wiktionary backup dump is the only thing that worries me, because the module data size is ~6MB right now. And each tiny adjustment of the dictionary regenerates it all for now, but it's possible to come up with an incremental updates scheme.
Do the curators of the Russian section of English Wiktionary have an opinion about making |text= faithful to the orthography of the original source and moving the accent markup to |norm=? --Ssvb (talk) 11:09, 11 April 2024 (UTC)
@Benwing2, @Atitarev: Could you please comment on this? I believe that, feature wise, I have finished the conversion functionality and I'm not aware of any remaining bugs in the algorithm or in the approach in general. But it performs a dictionary assisted conversion, so any errors or omissions in the existing Wiktionary entries snapshotted at https://dumps.wikimedia.org/enwiktionary/20240401/ will show up in the generated output. For example, the two testcases at the top of Module:User:Ssvb/ru-autoaccent/testcases rely on the existence of "антидилювиа́льный" and "за́ руку" entries. If somebody creates these entries right now, then the dictionary can be regenerated after the 20th of April upon the arrival of the next Wikimedia's backup dump. I can still tidy up the code to make it better commented, cleaner and faster, but now it's necessary to figure out what are the requirements and roadmap for integrating it into mainspace. --Ssvb (talk) 07:15, 13 April 2024 (UTC)
BTW, I wonder if the converter should keep the original spelling "антидилювіальныя" with "і" in its output or somehow visually highlight the word in other ways? My understanding is that it's the responsibility of a Wiktionary editor to provide the necessary |subst= crutches when adding a quotation. Automatic conversion can successfully and accurately do the bulk of the work, but still a human has to review the result and be ready to step in to add the necessary corrections. --Ssvb (talk) 07:41, 13 April 2024 (UTC)
@Ssvb Apologies for the delay in responding and apologies also for my slightly snippy comment about your earlier response. I need to look over your code in more detail, which I can probably do tomorrow (it's bed time for me now), but it feels to me like we need to resolve the issue of how to format quotations, transliterations and the like. I don't like the idea of having *only* the transliteration contain the accents; this is not how it's normally done here. Either the original or normalized version should contain the accents, too. This might need to differ between usexes (where it's probably OK to auto-accent the original) and quotes (where it's still to be resolved how to proceed). Also, just a note, any auto accenting you implement should err *STRONGLY* on the side of not putting in an accent if there's any doubt; better to have no accent than a wrong one. This goes especially for something that happens on the fly, because any review that the author does could become invalid due to a subsequent updating of the underlying data. Benwing2 (talk) 08:00, 13 April 2024 (UTC)
@Benwing2: I think that the Wiktionary quotation entries can have pretty detailed information in them and that the original unmodified spelling with all its archaic words and typos is a valuable piece of information, so it should be one of the template data fields. But the way how the information is presented to the end user in the browser is another matter. Having the original text, its automatic or manual normalization, the romanized transliteration and the English translation already makes it four lines of text in the browser instead of the current three lines. And things may look even more awkward if it's a multi-line poetry quotation. However, at the end of the day, the format of the presentation can be probably configurable on the frontend side and based on the end user's preferences. I mean, the end user may prefer to only see the modern normalized Cyrillic text and hide the original unaccented pre-1918 orthography to reduce the on-screen clutter, but this doesn't mean that we have to erase the original non-tampered quotations from the Template:quote-book instances themselves in the Russian entries.
As for erring *STRONGLY* on the side of not putting in an accent if there's any doubt, I think that with this approach none of the words can be safely accented. Because we can't rule out the possibility that the same word with a different stress position may be eventually added to the dictionary in the future. It's only possible to safeguard against this by not adding any accents at all. And the same goes for the prepositions "из", "за", "до", "у", "от", "со", "без", "по", "на" and many others. If we end up deciding not to accent any words adjacent to these prepositions at all, because of the risk of them being potentially a part of предложно-именные сочетания, then we would be doing a disservice to the users. I think that we should instead prioritize adding the missing dictionary entries to provide a good coverage for all of these cases. But maybe @Atitarev has an opinion on this?
As for the generation on the fly and future updates of the underlying data. I think that the module can probably automatically categorize quotations if it has troubles accenting something. I mean, let's suppose that Wiktionary initially had no entry for "за́мок" and the auto-accent module happily annotated this word as "замо́к" in one of the quotations. Now suppose that "за́мок" got eventually added to Wiktionary and the auto-accent module detected this inconsistency in a quotation. What's the best course of action? I think that just dropping the accent mark for "замок" and adding the page to some sort of a "need review" category would be a reasonable thing to do. --Ssvb (talk) 09:32, 13 April 2024 (UTC)
@Ssvb Apologies, I missed this. There are lots of online Russian dictionaries but I don't know of any where you can download all the data. Doesn't mean it doesn't exist though. You might ping User:Cinemantique on ruwikt, they seem to be still somewhat active there and may well know. Benwing2 (talk) 03:11, 19 April 2024 (UTC)
BTW, instead of my kinda lame and seemingly artificial "за́мок" / "замо́к" example, here's something more practical: the current module testcases include the word Ока́(Oká), which is accented because it isn't ambiguous. But what if somebody adds О́ко Сауро́на(Óko Sauróna) and its genitive form О́ка Сауро́на(Óka Sauróna) to Wiktionary in the future? Using something like "То есть по крайней мере, помимо облика Ока, Саурон имел и физический облик человека" as a quotation? This would invalidate the automatically generated on-the-fly accenting in the old quotations of the word "Ока". And we need to be prepared for the situations like this. --Ssvb (talk) 11:54, 13 April 2024 (UTC)
@Ssvb: Of the modern text versions (1 to 5), I like 1 (text, transliteration in the strict sense, and translation) best, then 5 (accented Cyrillic as normalisation, which is then transliterated), but can stomach 4, where the transliteration includes the stress. 6 and 7 are a different text - the difference from the first five matters if one is demonstrating spelling, but as a quotation for the emboldened word they are equivalent. I think Wiktionary policy would actually call for 6 or 7, as being the earliest issue available to the editor. I don't think it makes any difference for establishing the latest data of use of the emboldened word. --RichardW57m (talk) 11:54, 11 April 2024 (UTC)
Aside: I think the examples have abused |newversion=, and the presentation of the dates seems dodgy. I wonder if I may do that when millennia have elapsed between composition and the printed version quoted, with the 2nd version's fields referring to a translation not eligible for supporting the words used in it. I don't like the style of the wikitext - it's hard to extract pieces of data from it. --RichardW57m (talk) 12:08, 11 April 2024 (UTC)
@RichardW57m: The |newversion abuse was suggested here by @Sgconlaw. The usefulness of it is at least twofold:
We get an English translation, written by a native English speaker, who was also a professional translator. And by contrast, if I translate the Russian text into English myself, then I risk accidentally constructing something ungrammatical. Currently Template:quote-book doesn't support something like a |t-check parameter.
It attests that Constance Garnett considered that particular English word to be an appropriate translation for that particular Russian word back in 1919.
As for the dodgy dates, WT:QUOTE says "The year should be that of the earliest edition known to use the word. Where feasible, the page number should be taken from the first edition, but if a later edition is used (e.g. paperback version, or digitised by Google Books), then the publication date should be added in parentheses after the publisher’s name."
My understanding is that "Anna Karenina" novel was initially published from 1875 to 1877 in the literary journal "The Russian Messenger". And then there were multiple book editions printed after that. Russian Wikisource uses the book from 1903 for its pre-reform orthography text and the book from 1970 for the modern orthography text. Now which years should be referenced in the quote-book template? I wish there were a simple non-ambiguous and easy to follow guideline related to adding Russian quotations in WT:ARU. People would have a lot less headache. --Ssvb (talk) 15:33, 11 April 2024 (UTC)
@Ssvb: I don't dispute that the abuse is very useful. I've asked the same question myself, and had not got a satisfying answer. I've had the same problem using translations that are sometimes CC-BY-SA, so it's a legal obligation as well as courtesy to acknowledge the translator, which latter applies if the translation has been released into the public domain by the translator. (Sometimes I've felt obliged to use a more literal translation.) I'd been resorting to putting the acknowledgement in the |footnote= field. I've felt particularly cribbed when the spelling of an ancient text seems very much 20th century or later and I'm having to quote it from a scan published in yet another document. So I looked at the template invocations to see if I could learn a new trick. --RichardW57m (talk) 16:04, 11 April 2024 (UTC)
@Ssvb: Hi. My preference is #3 with accents in the souce language. Some people dislike linking each word. I think it's helpful for learners or someone who wants to analyse the text. The number of links could be reduced to some e.g. difficult, rare words and/or remove links for proper nouns, company/product names or words, which are NOT supposed to be linked (i.e. have entries).
IMO, providing accents on unaccented words is incorrect, e.g. #4. It was the original old practice, which has been eradicated over time. Let's not reintroduce it. :)
We already have an established practice to accentuate each Russian word and supply letter "ё" whereever a word appears. I belong to the group who wants to keep it that way, including quotes (and extend the good practice to other languages where appropriate). I go out of my way to provide accents fo Cyrillic-based Slavic languages, vocalisations for Arabic, and more recently Persian and Urdu, nuqta spellings for Hindi terms.
Errors like you mentioned re "входившіе" are uncommon. Editors may prefer to quote one or the other or both.
These changes sometimes deal with not just the spelling but the grammar. онѣ́(oně́) -> оне́(oné) -> они́(oní). The entry for оне́(oné) shows where the original Pushkin pronunciation was preserved for the sake of rhyming. Anatoli T.(обсудить/вклад)03:04, 15 April 2024 (UTC)
I'm not a fan of any of these. It's better to cite the first edition if possible, and in this case the first edition is available online. The quotation should be faithful to the source—Tolstoy wrote in the 1800s, not in 2024, and the quotation should reflect that. See Middle English pyteuous for an extreme example of this. That said, the normalization should be included if the original quotation is hard to understand (which might be the case here). Also, Anna Karenina was written 1873–1877 and published 1875–1877 as well as 1878 (per Wikipedia), so I think that you having |year=1877 is inaccurate. I propose:
1873–1877, Л Н Толстой , Анна Каренина , first volume, Москва: Типографія Т. Рисъ,, published 1878, page 103; English translation from Constance Garnett, transl., Anna Karenina: A Novel, Philadelphia, P.A.: George W. Jacobs & Company, 1919, page 86:
Когда они вышли, карета Вронскихъ уже отъѣхала. Входившіе люди все еще переговаривались о томъ, что случилось.
Kogdá oní výšli, karéta Vronskix užé otʺjéxala. Vxodívšije ljúdi vsjo ješčó peregovárivalisʹ o tom, što slučílosʹ.
When they went out the Vronskys' carriage had already driven away. People coming in were still talking of what happened.
@Ioaxxere: I wrote my previous comment right before I noticed that there was your reply already added. Thanks for all this additional information. I'm primarily interested in integrating the automatic text normalization auto-accenting Lua module, so focusing on the boilerplate with dates unfortunately may sidetrack the discussion a bit. That said, you raised a good question about whether the original quotation in the pre-reform orthography is easy or hard to understand for the intended Wiktionary users. I hope that somebody can answer it. --Ssvb (talk) 15:56, 11 April 2024 (UTC)
New label: ephemeral
There are many cases where a term briefly becomes extremely popular, and then fades into obscurity. Labelling these (dated) or (obsolete) doesn't feel right. Therefore, I propose the label (ephemeral) for this purpose along with an associated category. Some ephemeral terms in English:
Note that "ephemeral" only refers to time, not usage, so "ephemeral" terms can be slang, informal, or formal. Ioaxxere (talk) 01:06, 12 April 2024 (UTC)
At first glance, the concept seems questionable. How would one characterize the difference between 'dated' and 'ephemeral' in such a way that users would care? How brief a period of 'popularity' (itself problematic) would we require for a definition to be deemed 'ephemeral'? Less than a decade? Would we need to define a class of curves of usage frequency over time? Moreover, as a practical matter, even with satisfactory definitions and criteria, systematic application based on objective facts seems very unlikely. DCDuring (talk) 14:18, 12 April 2024 (UTC)
What happens to such words after twenty more years of successful curation here on Wiktionary? After a certain period of time, the use of any word no longer regularly used should just be considered "dated", or even "obsolete". Therefore, if you are already making the judgment that a word is "ephemeral", it has presumably already lost its regular usage and should probably also just be considered "dated" from the point at which the judgment has been made, for the foreseeable future of Wiktionary, or until such time that the word were to come back into vogue.
If a word is inextricably linked to a particular event whence its ephemerality derives, then that event and its ephemeral real nature should probably just be mentioned in its definition, rather than creating a new label that really just means to say "no longer used". It would also be difficult to determine what the requisite maximum "lifespan" of a word should be to be given this label.
It's worth noting that a word can be both "dated" and "ephemeral", however. I'm reminded of both inkhorn terms from 16th and 17th century English: Latinate neologisms brought into English and used for a short period of time—as well as several reactionary neologisms of more Anglo-Saxon stock, created or resuscitated from the depths of time in response, such as inwit and gleeman.
@Hermes Thrice Great: I don't like to use "dated" or "obsolete" to describe ephemeral terms, since those labels generally describe words which were in common use for a long time, maybe even centuries, before gradually falling out of use. Ioaxxere (talk) 17:20, 12 April 2024 (UTC)
Oh? I wasn't aware. I've put "dated" on old computer terminology that only existed for a decade or two. Equinox◑17:24, 12 April 2024 (UTC)
@Ioaxxere: For me, "dated" mostly refers to terms associated with a previous generation- using them "dates" you. Such terms were usually only popular during that generation. We've also used it in the sense you're referring to, but that's not the primary meaning. Chuck Entz (talk) 17:48, 12 April 2024 (UTC)
@Equinox, Chuck Entz: I think you two are agreeing with me. I consider "a decade or two", or a generation (~20 years), a relatively long time in comparison with the "ephemeral" terms listed above. Ioaxxere (talk) 18:09, 12 April 2024 (UTC)
I'm trying to think if I could use this. There was a city in Hubei named Jingsha for a little over two years between 1994 and 1996, that might be ephemeral. Also, I'm thinking that some Cultural Revolution-connected geographical terms may be ephemeral in this sense, but reach three durably archived cites. --Geographyinitiative (talk) 15:32, 13 April 2024 (UTC)
Popular for a relatively brief period of time (about a decade or less) before rapidly fading into obscurity. Thus, there is no period of time where an ephemeral term ever existed in the standard language. In general, a term might be identified as ephemeral if its usage trend graph is shaped like a spike. If a term is labelled as ephemeral, the entry should state the approximate time range in which the term was in use.
Still not sure this is useful. Of the three terms given above, one (bookshelf wealth) is a neologism still in use, and covidiot is too new to say for sure that it is passé (what if COVID keeps recurring?). I'd need to see several more words that clearly fit the category in order for me to judge it useful. My general concern is that if we keep adding usage terms with different definitions, people will become hopelessly confused as to which one to use and all of them will get diluted (similarly to having uncommon, rare, very rare, etc.). Benwing2 (talk) 20:42, 25 April 2024 (UTC)
We have 637 terms in Category:African-American Vernacular English and 13 in Category:African-American English. However, I can attest (being married to a Black woman who doesn't speak AAVE) that the vast majority of these terms are not restricted to AAVE. About the only ones I can think of that come to mind as really AAVE-only are terms like aks/axe/ax "ask" and fitna (= fixing to "going to") that are proscribed in Standard English, along with pronunciation spellings like smoove, mouf, foun' and skraight that represent AAVE-specific pronunciation features. (I should also note that we have uncharacterized entries for weird pronunciation spellings like debbil that sound to me like something out of Uncle Remus; these need to be marked as archaic or obsolete or something.) This suggests we either need to move the large majority of these terms under Category:African-American English or just merge the two (since I'm doubtful the average Wiktionarian will be able to keep them correctly categorized). Benwing2 (talk) 04:53, 12 April 2024 (UTC)
Well yeah, to the category African-American English while still making the distinction in the entry if an editor intents too, so I had merged nouchi with Ivorian French from the beginning, because the antinomy only arises if you look upon all terms collectively from bird's-eye view, via the category system.
Many terms of these terms raise our attention as slang—as in the Ivorian example, where you would think the language is the official language of a nation and hence there would be an even greater share of non-vernacular terms—, so I doubt non-proscribed ones are the vast majority, from what we even have documented. For either comprehensibility or markedness, most terms may be proscribed in use with the general American population, you probably proscribe less if it is your wife—it depends which social circle you take as a point of reference? If it’s a white conservative one then it is every single one in “Standard English”? The other terms are often just intransparent as to their origin from African-American speech (dig) or particular meaning (Talk:tired).
But that may be not what you are trying to say, you think about Standard English as from an Afro-American perspective (which you can very well assume), if that makes sense, but as you realize the average Wiktionarians cannot consistently make sense out of it though he have enough charity to get your idea, never setting foot in America but consuming her media, like you have not been to Abidjan to accurately judge registers (this is just a statistical probability). Fay Freak (talk) 08:45, 12 April 2024 (UTC)
@Benwing2: What does "AAVE-only" mean? "Fitna" and "skraight" are the only ones that would be out of place in non-standard English dialogue out of the mouth of an Englishman of native origin. And as /skr/ is mostly of Danish origin in English, I wouldn't be surprised to find that the scream-stream merger came from the English West Country. --RichardW57m (talk) 09:55, 12 April 2024 (UTC)
I think typing shortcuts could help with future labelling and categorization. At present:
The situation is, has been, and probably will be dynamic. Labels are likely to change and their application even more so. Empirical support for the application of the labels is likely to be scarce and soon (one or two decades) become dated or even deemed insulting. I don't think we can count on many contributors to make systematic revisions of fine categories under these circumstances. So labels that are not controversial and of broad application are probably best. "African-American English" would seem to fit the bill. Also, we do have register labels to add finer descriptions and for searches using Cirrus Search. DCDuring (talk) 13:04, 12 April 2024 (UTC)
Absolutely agree this is an issue. It's come up before, e.g. 2020 discussion here, linking to earlier discussions. Not only on Wiktionary but in the world people sometimes refer to the vocabulary of Black Americans in general, or "Black rappers' slang", "Black Twitter" etc as "AAVE", maybe due to not knowing what else to call it, and even in linguistics literature many uses of "African American English" use it as a synonym of "AAVE", so even that label doesn't unambiguously distinguish anything. Many uses of "MLE" on Wiktionary seem to similarly just mean "words Black rappers in Britain use" rather than specifically MLE. Sometimes white people have labelled entries "MLE, AAVE", just meaning Black rappers on both sides of the Atlantic use it, not that it manages to belong and be limited to the specific lect of AAVE and also the specific lect of MLE. I do feel some reluctance to do away with (merge into broader labels) specific AAVE or MLE labels, because there are a few things in those categories which belong to those specific lects, but it'll continue to be a constant, effort-intensive (and if the current state of the categories is anything to go on, losing) task to keep checking the categories' contents for wrong entries, because it's clear people both on Wiktionary and in the world at large don't distinguish "AAVE" from "words Black Americans use". Yes, Remus-esque stuff like debbil or ebery needs its own category... as I said in 2020, I wouldn't even put them in whatever "African-American English" category we put the rest in (even with an "obsolete" label), because it's not clear they were mainly or ever used by Black people, as opposed to just by white people caricaturing Black people. Maybe create a label for use in pronunciation spelling of|en|foo|from=bar}}, to display something like "19th century white caricatures of Black speech" or something... - -sche(discuss)14:49, 12 April 2024 (UTC)
"African-American Vernacular English" and "African-American English" are synonyms, so the categories should be merged. Ioaxxere (talk) 16:22, 12 April 2024 (UTC)
They're sometimes synonyms. But they're also sometimes not synonyms, sometimes African-American Englishincludes AAVE but also other African-American sociolects of English. Most of what we currently have labelled "AAVE" is actually not AAVE, just AAE, as Benwing points out. But given the occasional synonymy and the widespread inability to consistently distinguish them, maybe we are better off moving all of this to a new label like "Black American English". - -sche(discuss)17:51, 12 April 2024 (UTC)
Because any difference to Canadian Black English is also a black box? Has a parallel advantage if we introduce “Black British English”, which may or may not include Ireland, but we already have Category:Multicultural Toronto English, which would be another Black American English, and that like MLE only in the last three decades, so behind it is a substratum of Black English like in the US, not a recent “multiethnolect”, a substratum Britain did not have. Strange that the term Canadian Black English is a hapax. There is some truth to even linguistic literature being at a loss. Fay Freak (talk) 21:31, 12 April 2024 (UTC)
@Ioaxxere: Isn’t that what I was saying? (Though I have doubts they will stay distinct later into the century.) It would be strange if we rename MLE to British Black English but keep “MTE” modelled after the MLE term, and also if we take advantage of the term “Black American English” but not “Black British English”, we are in a pickle.
My understanding is that the demographic surge of emancipated slaves and their descendants leaving the US South combined with barriers set up to keep them out of the mainstream created a distinctive type of Southern-based speech that almost swamped out the other varieties. Those other varieties aren't really part of AAVE, nor is that of those who made it into the mainstream (not to mention more recent immigrants from the Caribbean and from the rest of the world). The problem is that this complexity is almost invisible to those of us from other parts of US society, so it's too easy to conflate everything else with AAVE. Chuck Entz (talk) 23:08, 12 April 2024 (UTC)
We don't have a good factual basis for maintaining relatively fine distinctions, so a broad label that is subject to criticism, but defensible, is probably better than narrow ones, which are also subject to criticism. Most criticism doesn't seem to be based on systematic facts, just anecdotes and idiolects. DCDuring (talk) 15:01, 13 April 2024 (UTC)
Regardless of what we do with AAE vs AAVE, I suggest adding a label (particularly for {{pronunciation spelling of}} et al.) for debbil, gibsmedat et al.: "white caricatures of Black speech"? "racist caricatures of Black speech"? Does anyone have a better name? For AAVE vs AAE, I continue to be of two minds. On one hand, it might be useful to categorize the few things which are AAVE as AAVE. OTOH, many people mislabel just any words which Black people use or are caricatured as using as "AAVE", when they're actually AAE or 19th century racist caricatures, so we have a perennial maintenance task on our hands. But... is it right to give up precision because certain people (ignorantly or intentionally) misuse and muddy terms? Those people will (and do: look at the edit history of gibsmedat) also mislabel caricatures as real speech, so we have to monitor and clean up the categories anyway... so maybe we should keep "AAVE" and "AAE" distinct and just clean up entries in AAVE that belong in AAE? OTOH, we don't seem to get that fine-grained with anything else, so maybe the easier solution (merging AAVE into AAE, or merging both into "Black American English") would also fit how we also don't overly-finely distinguish "Southern American English" dialectal forms from words that speakers from Southern states use : they both go in Category:Southern US English. (And likewise for other categories, e.g. "gay slang" is a mishmash of everything from bathhouse to modern slang, and "transgender slang" is a disparate mix of "words academics use", "words most trans people use", and "words a minority of incels on 4chan use", all in one category, good luck figuring out which are which if you don't already know, sucker reader!). - -sche(discuss)04:13, 22 May 2024 (UTC)
This was brought up in Category talk:African-American Vernacular English#RFM discussion: November–December 2021 as well, and my position is still the same now. I would not support a merger of the two, especially not under the title "Black American English". AAVE is the term most-used in literature and is the academic term that most people working in this field are used to. I do agree that we need a better distinction between AAE & AAVE, but that requires more people that work in the field to contribute. AG202 (talk) 12:16, 22 May 2024 (UTC)
Re needing more people to contribute, I think it's kinda the opposite: from Benwing's comments I get the impression that he could (and in any case, I could) sort our current entries into the correct categories; the problem is stopping people from coming behind us and switching entries (back) to incorrect labels (as in the edit histories of ahun and gibsmedat), and stopping people from adding new incorrect information—stopping people who (for various reasons) just add any and all Black American words as "AAVE" (and any and all Black British words as "MLE", etc). Maybe we could have a "don't change Ladin to Latin" style edit filter that prevents non-autopatrollers from changing "AAE" to "AAVE" (and likewise for the spelled-out forms), and another that tracks new additions of either so they can be monitored. (It might also be helpful to use one of the alternate names that are used for "AAE" that are more clearly distinct from "AAVE", like "Black American English".) - -sche(discuss)16:02, 22 May 2024 (UTC)
If that's what's happening, then yes I'd support a filter being made. I'll also mention that some terms will definitely have to be listed under both AAVE and AAE (likely a good chunk of terms under AAVE currently). AG202 (talk) 16:11, 22 May 2024 (UTC)
I assume any term under AAE can also go under AAVE but I don't see why we need to do that. If we put all the non-AAVE-specific terms under AAE, then since AAVE is a subcategory, I don't see why we need to duplicate everything under AAVE as well. As for a filter, yes I'd support that. Benwing2 (talk) 18:09, 22 May 2024 (UTC)
Right, things should not normally be categorized into both a parent and a child category; only things that are specific to AAVE should be in the AAVE category. If they're general AAE (not specific to AAVE), they should just go in the AAE category, in the same way that when a word is general British English we only put it in the British English category, we don't normally also list it under Bedfordshire English, Berkshire English, Essex English, Isle of Wight English, Kentish English, Oxfordshire English, Sussex English, etc. - -sche(discuss)19:25, 22 May 2024 (UTC)
@-sche: I guess the issue for me is that it'd be hard for me to see the vast majority of terms currently labeled with "AAVE" as just having the label "AAE," when "AAVE" is much more common in the public sphere when referring to marked Black American lexicon. Also, do Black Americans that don't speak AAVE or another marked variety like African-American Appalachian English even have their own unique universal lexicon? Speaking as a Black American who moved to the US very young and didn't learn/speak AAVE until later in life, I don't think I've really seen anything that would fall under that grouping, nor have I seen anything yet in scholarship. As a comparison, I'm sure that there are some Americans that use kitchen paper, but that doesn't make it any less of a British term. The page for African-American English also doesn't even mention any lexical differences for standard AAE: "This variety exhibits standard English vocabulary and grammar but often retains certain elements of the unique AAVE accent", which leads me to believe that there's actually no distinction in terms of lexicon, even if you consider AAE to be separate (which most sources do not, from what I've seen).
Hm, I guess I've changed my mind and would support some kind of merge, but I still don't support renaming it to "Black American English" as that term sees little currency. These are the options that I see currently:
Status quo: leaving it as is, trying to aim for a separation that likely does not exist in terms of lexicon.
Merging our current AAE terms into AAVE, following the term that's most known by readers, even if the "Vernacular" portion is a little bit misleading. This is the terminology that Wikipedia prefers as well. This would also involve leaving Category:African-American English as solely a parent category for AAVE, African-American Outer Banks English, etc., with no terms of its own.
Merging AAVE into AAE. There's been a push recently, led largely by Lisa Green to rename AAVE as AAE, as stated in her book, African American English. This term is also going to be used by the forthcoming Oxford Dictionary of African American English. This option runs into possibly alienating a large number of readers who are more familiar with "AAVE." She also mentions that not all Black Americans speak AAE to begin with, which runs into the same issue that this thread was started with (but that imho is a non-issue).
Today, while some researchers choose to use African American English, others African American Vernacular English (AAVE) and still others African American Language, they are all referring to the same variety – that which I have defined in the introductory statements in this book and will discuss throughout this work. Also, although I refer to African American and African American community, I do not intend to imply that this linguistic variety is associated with all African Americans – it is not – any more than I intend to suggest that all African Americans are a part of some large abstract community.
At this point, I'm personally oscillating between options 2 & 3, with a slight preference for 2 as readers would be more familiar with the term. AG202 (talk) 20:42, 22 May 2024 (UTC)
Re "do Black Americans that don't speak AAVE or another marked variety like African-American Appalachian English even have their own unique universal lexicon?": this is an interesting question. Is our current way of having categories (not just here) improper? Like: we have "gay slang" and "transgender slang" categories even though (as I mentioned) there is not one singular slang that all gay or trans people use. Likewise, Category:American English is not universally used by all Americans: I wonder if even half of Americans know what Adam and Eve on a raft is, and 0-dark-thirty is specifically US military slang (but instead of having a category for "US military slang", we put it in "American English" + one category for all nations' "military slang"). Few of our categories seem to posit that their contents are one unique lect or lexicon that all members of the named group universally use. Is that a problem? I don't know. What do you think? I suppose one option, both in this specific case and in general, would be to see if we could add more specific labels/categories, and see how many non-AAVE entries we could move out of the AAVE category into new categories (or possibly, in some cases, leave uncategorized but with usage notes). E.g., one of the terms that prompted previous discussions of this was Caucasity, because:
on one hand, some people wanted to indicate that it was originally/mainly used by Black people, even though of course not monolithically all Black people. I see some people in media or social media use it — e.g. Black Twitter and media like thegrio and BET, and Black podcasters — but I haven't heard e.g. mostly-offline old people use it. (What about you, what is your perception of it?)
and on the other hand, other people correctly objected that it's not AAVE, it's used by GenAm speakers who are Black more than by AAVE speakers (as far as anyone could tell in the previous discussions of it. Is your perception different?)
So maybe we could add a category like "Black social media slang"(?), and likewise add other specifi categories, and in the general case, maybe we could separate e.g. incel/4chan-type trans slang from general trans slang, older e.g. bathhouse-era gay slang vs modern Gay Twitter slang, Grindr slang, etc, etc?... OTOH, such categories might be too narrow to contain much, and terms have a tendency to 'bleed' (e.g. what I would refer to as 4chan-type slang also gets used by 4channers when they're on twitter), so maybe the (current?) lumpy approach is ?easier/better?, putting all types of gay slang in "gay slang", all non-AAVE-or-other-subcategorized AAE in "AAE", etc, even if not all gay people universally and not all Black Americans universally use all the words in the "gay slang" or "AAE" categories...? - -sche(discuss)22:31, 22 May 2024 (UTC)
@AG202, Benwing2, unless you see issues / have objections (?), I intend to at least proceed with creating a separate label / category for the caricatures like "debbil", "ebery", "gibsmedat" (made up by white authors rather than, or at least much more so than, used by Black speakers AFAICT; can you discern any different?), so they're not in the "AA(V)E" category(s)... something like "caricatures of Black speech", or do you have any thoughts on a better name/category? (If we felt like being blunt we might say "racist caricatures of Black speech", but racists would definitely fuss.) IMO we should split or pare some other categories similarly, e.g. a lot of 4chan transphobes' slang is currently double-categorized into the "transgender slang" category (apparently because some minority of trans people who use 4chan have supposedly also used it), but it is so much more characteristically 4chan slang, and not characteristically trans slang, that it should only be categorized as 4chan slang (similar to how it'd be misleading to say favour was both a British spelling and an American spelling, just because a minority of Americans do use British spellings, a minority of trans people use 4chan, etc). I'll see about fixing that next. - -sche(discuss)15:20, 23 June 2024 (UTC)
Do we have a factual basis for entering and retaining such relatively fine categories and for establishing the state of mind of, say, 19th century white authors? Or is it a matter of current assertions about such states of mind or of current evaluative perceptions of such terms? Do we even have good evidence for those? I suppose having a separate template and forum for challenging such things would be a bad idea for the kind of controversy they would lead to, but I think we need to take care about both fine and evaluative distinctions in labels, templates, and categories. DCDuring (talk) 16:06, 23 June 2024 (UTC)
Ukrainian IPA transcriptions—in particular concerning the vowel И
I’ve noticed that many (the majority) of Ukrainian entries here on English Wikipedia transcribe the Cyrillic letter И as , when in fact it should—in most every case (but not all)—be transcribed as (see, for example w:Help:IPA/Ukrainian). This is especially obvious when comparing the IPA transcriptions for the same word on Ukrainian Wikipedia and English Wikipedia—here are a few examples:
It seems that most of the cases where the erroneous IPA transcriptions show up were added by the bot User:WingerBot, and the IPA transcriptions tend to be more correct with regards to this letter when a real user (who presumably knows Ukrainian) had added the transcription prior to the bot having done its work on the Ukrainian corpus here on English Wikipedia. For example, in the entry ринок, the transcription, provided by a real user, has been given correctly as .
I'm not sure if the bot could be reconfigured to go through these entries and fix them, but this would be preferable to having to go through all these entries manually, especially because not all Ukrainian entries are affected, and also there are indeed cases where ought to be the IPA transcription for И.
@Hermes Thrice Great I don't think this has anything to do real users vs bots, because all of these entries seem to use {{uk-IPA}}, including the one you've given as an example of not having the problem. Evidently a distinction is made in the underlying module somewhere (which looks to be down to whether the syllable is stressed or not). It would be a really bad idea to go through all of these manually, because that would decouple the entries from the template, meaning that it's much harder to ensure entries are kept in sync. Theknightwho (talk) 16:18, 12 April 2024 (UTC)
@Theknightwho Sorry, you’re right, I should have looked into the template code. Yikes, this is a real mess then.
/ɛ/ and /ɪ/ approach , which may be a shared allophone for the two phonemes.
This is equivocal as to whether these two sounds are actually the same; I imagine it depends on how formal the speaker is being. Maybe Anatoli can comment. Benwing2 (talk) 18:55, 12 April 2024 (UTC)
Yes, that’s why I said in some cases it makes sense to transliterate it that way. And yes, it may approach a shared allophone in some cases, but especially at the end of a word in the infinitive form, and especially where there are the two separate vowels Е and then final И in the word, like for example перекладати, the difference should be transcribed. In the example I just linked, you can hear very clearly in the audio the difference in the two vowels, yet the IPA transcription given transcribes them the same, as instead of .
@Benwing2, @Theknightwho, @Hermes Thrice Great: It’s not a mess. The module was based on one classical version of Ukrainian, which may not be common, known or accurate from the modern perspective. It was consistent. I don’t have any objection to change to for the unstressed «и». I’m sure there are cases where the pronunciation is not considered modern or common but there are too many opinions on this. I prefer the scheme to be sourced. Anatoli T.(обсудить/вклад)01:46, 13 April 2024 (UTC)
To my orthoepically untrained ear, "и" sounds closer to in stressed and unstressed syllables, but can tend towards when it occurs unstressed at the end of a word. In any event I share Atitarev's preference for a sourced transcription scheme. Voltaigne (talk) 11:19, 13 April 2024 (UTC)
Ukrainian here. In school, I was taught that е/и approach in prefixes, suffixes and word stems, but not in word endings. (for some reason -ти is an exception, it also shouldn’t be .) So, for example, ‹читати› using the Ukrainian pronunciation spelling system would be . And I guess in IPA it would’ve been . БудетЛучше (talk) 11:42, 24 June 2024 (UTC)
Mainspace Proto-West-Germanic?
ᚲᚨᛒᚨ and Proto-Germanic *kambaz are now in CAT:E because the former has been created in mainspace as a Proto-West-Germanic entry and the latter links to it. It's true that ᚲᚨᛒᚨ is known from writing on a comb from the 3rd century that was found in the near Erfurt or Frienstedt in Germany. It apparently isn't Old Norse, which is allowed in mainspace, or East Germanic. That leaves either Proto-Germanic or Proto-West-Germanic. I don't know enough about either to say what should be done, but we can't leave things the way they are. Chuck Entz (talk) 23:50, 12 April 2024 (UTC)
I see two alternative approaches:
1: downgrade mainspace refusion errors in proto-languages rarely ever attested to editor-directed warnings on editing and on read with {{attn}} which now we see one every page with a gadget, and categorization of such pages into “mainspace entries of a language almost always only reconstructed”, since we don’t even have categorization of the error-affected section at all. I.e. raise maximum attention but still operate.
2: Or add some exception list for particular pages that only people privileged due to knowing what they do (autopatrollers) can modify, or even know about, which would raise enough awareness. I.e. have a preventive approach and annoy during the editing process already.
Personally I prefer the second as the more aggressive approach.
The appropriate language name though is probably Proto-West Germanic, in view of the dating and ending, having been created before we introduced Proto-West Germanic. Fay Freak (talk) 11:54, 13 April 2024 (UTC)
A list of approved mainspace pages could be a good idea for a case like this where the language will have very few, iff it isn't too memory-"expensive". My reaction would've been to just change PWG to function like Proto-Norse, if some terms in PWG are attested (if we take the attested terms to be PWG), but that has the downside that then people aren't warned if they forget to put the asterisk in front of any of the more numerous unattested words. - -sche(discuss)14:09, 13 April 2024 (UTC)
@-sche Small lists of exceptions like this aren’t the kind of thing that impact memory use in a problematic way, so I wouldn’t worry about that. Theknightwho (talk) 14:19, 13 April 2024 (UTC)
Though a better solution might be some kind of manual override (an “anti-asterisk”), for use with reconstructed languages, which could be placed in headwords/links to suppress the error. It would have to be something relatively obscure, so it couldn’t be entered by mistake, and unlike an asterisk it wouldn’t actually display. Theknightwho (talk) 14:22, 13 April 2024 (UTC)
This has lower maintenance and hence annoyance than an exception list; we would have a tracking category filled by the parametral override, for such rare cases then, in place of an module-data exception list—though categories have memory impact, it would barely ever not be on low-stress page-titles. Fay Freak (talk) 17:52, 13 April 2024 (UTC)
Maybe ‼foobar or ꜝfoobar? Inspired by "sic!", but thinking !foobar itself might not be the best choice because someone might actually type that. Just spitballing. (Obviously, past some threshold, a reconstructed proto-language has enough attested terms to merit just treating it as attested, but I don't know if we want "one attested term" to be the threshold.) - -sche(discuss)20:49, 13 April 2024 (UTC)
"Forget to put asteriks in reconstructions"? If a word is a reconstruction, you'll never forget to put an asteriks if you know what your doing. Did such mistakes happened before? And who is gonna do such mistake? Is it not a policy we have here on Wiktionary to put asterikses for reconstructed words? Why worry yourself then? Tollef Salemann (talk) 19:48, 13 April 2024 (UTC)
Would that that were so! Sadly, not everyone knows, or at least manages, to do everything right all of the time... even in your own comment you forgot to put the apostrophe that goes in know what you're doing — or perhaps that was your point/joke, heh. The number of unattested or attested terms that have been RFVed, RFDed or RFMed to move them either to or out of the reconstruction namespace is non-trivial, and is just a fraction of the cases where someone incorrectly omitted or used an asterisk. - -sche(discuss)20:49, 13 April 2024 (UTC)
Potentially irrelevant question: how certain are we that the inscription on the Erfurt-Frienstedt comb even means "comb"? It seems odd for a comb to be labelled "comb". Could it just be saying that the comb was owned by someone named "Kaba"? Ioaxxere (talk) 01:23, 14 April 2024 (UTC)
Nothing is certain, but some variation on this is exactly what would be expected as a word for comb in that time and place based on comparing attested terms in related languages. It's not a coincidence that the ancestor of this term is reconstructed as *kambaz. Also, writing in early Germanic cultures had dimensions not found in modern culture- it had some kind of intrinsic magic. Many of the objects from that period had explicit references to the evil things that were supposed to happen to anyone who stole them. The circumstantial evidence points to this being some kind of offering that was intentionally placed where it was, so the writing may well have had some deeper purpose. Chuck Entz (talk) 02:47, 14 April 2024 (UTC)
Actually, in Old Norse runic inscriptions, most of short inscriptions on objects ending with -a are infact names of the owners (where "a"-ending is a shortening of "á mik"). But in PWG it is not gonna work same way with this ending of course. Anyway, the inscriptions on objects calling objects names can be not only made with magical purpose as Chuck says, but also just for fun (like mr. Tenevil did object inscriptions in Chukotka). Tollef Salemann (talk) 07:45, 14 April 2024 (UTC)
Also, i don’t remember how many exactly, but there are some other objects with runic inscriptions on them calling the name of the object and not the owner. On the other hand, if the owner’s name is used, it is often used in context of an ownership phrase, also not the name alone. Not sure how it works for Germany tho. Tollef Salemann (talk) 07:55, 14 April 2024 (UTC)
For now, I have commented out the "reconstructed" field so that, like Proto-Norse, this is allowed to have entries. If an "anti-asterisk" is added, we could undo that (if we consider that unlike Proto-Norse, the number of attested terms here is too small). - -sche(discuss)04:49, 17 May 2024 (UTC)
clarify boilerplate on categories for terms derived from vs topical to fiction
"CAT:English terms derived from Star Trek" is for terms like cloaking device, which was coined by Star Trek. (In this case, cloaking device happens to also not belong in CAT:en:Star Trek, because it's a general-English word and no longer has any closer connection to Star Trek than captain does.)
"CAT:en:Star Wars" is for terms like Glup Shitto, which is directly about Star Wars. (Glup Shitto does not belong in "...derived from Star Wars", because it's not used in Star Wars or derived from anything that is.)
Some terms, e.g. Dalek, both derive from and are directly about (a race from) a fictional franchise, and belong in both a topical category and a "derived from" category.
But sometimes a franchise coins a term X, or the franchise itself is named X, and someone else modifies X to XY: is XY "derived from" the franchise?
Does Vaderesque belong in "terms derived from Star Wars"? (Or Voldemortian in "terms derived from Harry Potter"?) The franchises only coined "Vader" and "Voldemort", someone else added the suffixes.
Does Warsie belong in "terms derived from Star Wars"? (Or Trekkie in "...derived from Star Trek"?)
Does Hand Solo belong in "terms derived from Star Wars"? It derives from modifying the name of a character.
Asking here rather than in the December TR in hopes of reaching more people. How people feel about these will help with rewriting the boilerplate: if the answer to any is "yes", we should change the "derived from" categories' text from saying terms "originated in" the franchise, to saying ~"originated in or are derived from", whereas if the answer to all three is "no" and we exclusively want terms coined by the franchise, not just derived from it, then we should rename the categories from "derived from" to "coined by"... - -sche(discuss)05:15, 13 April 2024 (UTC)
PS how 'expensive' would it be for the modules that generate the categories' boilerplates to check, on either type of category, whether a corresponding category of the other type exists (or failing that, to let users input the other category into e.g. a "crossreference=" parameter), and link each category to the other so that users are aware of both and hopefully glean the distinction from comparing them? - -sche(discuss)05:15, 13 April 2024 (UTC)
The "core difference" mentioned above seems somewhat different to what was mentioned by some editors during the Tea Room discussion, since it was posited there that "Category:English terms derived from Star Wars" includes all terms etymologically derived from terms invented in the franchise, while "Category:en:Star Wars" includes terms semantically related to the franchise.
I don't think there is anything inherent in the way either of the above categories is named which points one way or another. Derived is quite a general word. Thus, I think it is open to us to define the categories in whatever way achieves consensus.
I think terms like Hand Solo ought to be in one of these categories, because—like it or not—I'll bet editors will be adding them to such categories anyway.
Ideally, the categories should be defined in such a way that a term belongs in one or the other category, not both. (I assume "Category:English terms derived from Star Wars" will remain a subcategory of "Category:en:Star Wars". Thus, editors should put a term in the subcategory if it applies, and only put it in the parent category if the subcategory is inapplicable.)
I'm minded to suggest the following definitions for the categories:
"Category:English terms derived from Star Wars" is for terms which were coined in the franchise as well as terms etymologically derived from such terms, but excluding the name of the franchise itself unless it is used in-universe within the franchise.
Comments: I wholeheartedly agree that the category should be limited to coinages to exclude general terms like cloaking device and moon which may happen to be used in the franchise. Extending the category to terms etymologically derived from franchise terms means that terms like Hand Solo and Vaderesque will be included.
"Category:en:Star Wars" is for terms which are neither coined in the franchise nor etymologically derived from such terms, but relate in some way to the franchise. Terms etymologically derived from the name of the franchise should be placed here, if the name is not actually used in-universe within the franchise.
Comments: Thus, a term like Glup Shitto would be placed in this category. This category would also include terms like Star Wars Day, Trekkie, and Warsie since (I assume) the terms Star Wars and Star Trek aren't actually used in-universe within the franchise.
For want of a better idea to draw in more input besides us three, here's a poll... may we get an actionable result re: at least the unclarity of "derived from"... (modified boilerplate to clarify etymological derivation/coinage vs semantic/topical relation seems like it should be easier to agree on). - -sche(discuss)22:34, 28 April 2024 (UTC)
poll: rename "derived from" fiction cats to "coined in", or allow non-coinage derivations
Option A: rename Category:English terms derived from Star Wars and other "...terms derived from ..." categories to "...terms coined in..." (or "...terms coined by...") to clarify that it should only include terms coined in Star Wars.
I.e., only terms like Wookiee, but not other terms derived from Star Wars like Vaderesque (the Star Wars films only coined Vader; the suffix -esque was added by someone else) or Hand Solo (Star Wars only coined Han Solo, it was modified to Hand by someone else).
Option B: leave Category:English terms derived from Star Wars et al. named as-is, and include all terms etymologically derived from the fiction franchise (including e.g. Vaderesque or Hand Solo) not merely terms coined in it.
Option C: (1) Create a new category "Category:English terms coined in Star Wars" strictly for terms that are coined for and used in the franchise and excluding any etymological derivations therefrom (e.g., Wookiee, but not Star Wars or Vaderesque). (2) Reuse "Category:English terms derived from Star Wars" for terms that are etymologically derived from terms in the first category (e.g., Vaderesque) or semantically related to the franchise (e.g., Star Wars, Warsie). (3) Make both of these subcategories of "Category:Star Wars", which is only a parent category and should not have any entries in it.
Option D: move terms coined in the franchise to "...terms coined by..." categories, leaving the "...terms derived from..." categories for terms derived in other ways like Vaderesque (which derives from the franchise-coined name Vader, but does not itself appear in Star Wars). ("CAT:en:Star Wars" will continues to house semantically related terms like Glup Shitto, including, as in that case, when they're not derived from Star Wars and so don't belong in any "derived..." or "coined..." categories.)
As you've currently worded it, Option C would include any terms "semantically related to the franchise", even if the terms are not derived from the franchise, in the "derived from" category (so e.g. it would categorize Glup Shitto, which is not derived from Star Wars, as "derived from Star Wars"). Is this your intention? If so, I must strongly oppose C, as it would introduce incorrect information. I could support an "Option D" of moving terms coined in the franchise to "...terms coined by..." categories, leaving the "...terms derived from..." categories for terms derived in other ways like Vaderesque, and letting "CAT:en:Star Wars" continue to house semantically related terms like Glup Shitto. (Indeed, I didn't even mention "CAT:en:Star Wars" in my proposed options A and B because I assume it will continue to house semantically related terms like Glup Shitto; if you're envisoning Option C as meaning there's no category specifically for semantically related terms, I must oppose C on that ground, too.) - -sche(discuss)22:00, 29 April 2024 (UTC)
@-sche: yes, that was how I intended to redefine “Category:English terms derived from Star Wars” (that is, to include both terms etymologically and semantically related) because, as I expressed earlier in this discussion, the word derived doesn’t have a strict meaning (terms semantically linked can also be said to be derived), and it seems better to put such terms in a subcategory than just leaving them in the parent category. But I’m OK with Option A as I’d rather there be some clarification on how “Category:en:Star Wars” and “Category:Terms coined in from Star Wars” should be used rather than no resolution of the issue. Option B (which I think I previously suggested too) and Option D (“Category:Terms coined in Star Wars” strictly for terms used in the franchise, “Category:Terms derived from Star Wars” for terms etymologically derived from terms used in the franchise, and everything else in the parent category “Category:en:Star Wars”) are also fine with me, though I wonder if Option D is overly complicated for editors. — Sgconlaw (talk) 23:20, 29 April 2024 (UTC)
@Sgconlaw: We could keep the derivation categories for terms that are etymologically related to a work in any way, as opposed to only terms that appear in a work. J3133 (talk) 17:40, 7 May 2024 (UTC)
I don't have much preference between A or B or D (but strongly oppose C). Preference for B, because it's simpler than my second choice of D; third choice is A but it leaves many derived terms categoryless. (Oppose C, which is the worst of all words as discussed above, leaving some terms without a sensible category and making other categories into mishmashes of unrelated things.) - -sche(discuss)22:00, 29 April 2024 (UTC)
Option B, as a solution that straightforwardly divides terms into etymological and semantical categories, not leaving any terms relating to a work without a category. J3133 (talk) 03:00, 8 May 2024 (UTC)
French Translingual
There used to be five entries in the nonexistent category Category:French Translingual: ĵ, ŝ, ꞓ, ꞓ̂ and ꭒ. These are due to User:Kwamikagami; they used to e.g. define ĵ as "j with a circumflex" but Kwami changed them to try and capture some of their common uses, in this case by changing the definition to
# {{lb|mul|phonetics|French}} {{ng|The affricate sound of English '']''.}}
This was leading to the weird categorization. I made a change a couple of weeks ago to add language restrictions to labels like French so they only are supposed to categorize for a subset of known good languages. This was broken but I just fixed it, so now these entries no longer categorize. This leaves the question, though, of what the definitions *SHOULD* be. I think Kwami's point was that the definition "j with a circumflex" doesn't carry much information, but OTOH I'm sure ĵ is used for purposes other than the now-stated one. Thoughts as to how to make the definitions better? Benwing2 (talk) 21:47, 13 April 2024 (UTC)
Does "French" mean this is only used (in this way) in the French language? Then the definition should be in a ==French== section, no? Or is there a "Frenchist" school of phonological notation the way there is an "Americanist" one (that uses "y" for /j/, etc) one, and people writing in multiple languages (but using this school's notation) would use this letter this way? Then I would think that either needs a different label, or needs to be in the definition, a la # {{ng|The affricate sound of English '']'', in Frenchist phonological notation.}} replacing "Frenchist" with whatever the appropriate term is. (Even if using labels in this way no longer categorizes, we probably still want to track it, because it seems like ... well, a sign of an entry which should be cleaned up ...) - -sche(discuss)23:09, 13 April 2024 (UTC)
@-sche That's a good idea, I will create some cleanup categories or tracking pages. I think the intent was to express the idea of a "Frenchist" school of notation but I don't know if any such thing actually exists. Benwing2 (talk) 23:31, 13 April 2024 (UTC)
Maybe any label which—with the language being mul—would normally generate "Category: Translingual", could generate a cleanup/attention category instead? (Maybe any regional label + Translingual should also be monitored, e.g. {{lb|mul|Appalachia}} / "Category:Appalachian Translingual"?) If it would not be too difficult or expensive, it would also be useful to track/categorize cases of "Category: ", e.g. "Category:German Irish", another persistent type of erroneous use of labels (when people use "Germany" etc as a topic label); the few categories where that is actually correct, e.g. "Category:Vietnamese Chinese", would need to be exempted. - -sche(discuss)23:52, 13 April 2024 (UTC)
@-sche BTW I am skeptical there is any such thing as a Frenchist phonetic school; the IPA was created in fact by French linguists and all French dictionaries I've ever seen use the IPA. Benwing2 (talk) 00:12, 14 April 2024 (UTC)
I suspect that Initiation à la phonétique is idiosyncratic to that author.
The ĵ, ŝ, ꞓ, ꞓ̂, ꭒ convention is used for transcriptions of other languages. It is a "Frenchist" system, in that AFAICT it's only ever found in French-language texts. But even so, it differs from e.g. <ă> in the Merriam-Webster and Random House transcriptions found in American dictionaries in the sense that the MW characters are used in English for English. For ĵ, ŝ, ꞓ, ꞓ̂, ꭒ, they're used in French-language texts to spell out other languages phonetically. I'd naively expect that to be covered by the word "translingual": it's rendering one language in a script legible in another.
A similar situation might be ʹ. Let's suppose for the sake of argument that, in its entry for being a transliteration of the Cyrillic soft sign, it only occurs in English-language texts. (ʹ of course is not so restricted, but we may be able to find Library of Congress transliterations that are.) Should that sense then be listed under 'English'? The words it appears in are never English, only Russian etc. If we placed it under 'English', I suspect our readers would understand it to mean that it's English. And if they were searching for it to understand Russian transliteration, would they know to look for it under 'English'?
If the Americanist phonetic notation were only attested from use in English-language sources, would we reclassify all the symbols as 'English'?
If we place the 'Frenchist' letters under 'French', and discover that people writing in other languages of Francophone countries, such as Wolof, used them in books written in their languages, would we then need to move them back to 'Translingual'?
I'd think that a system designed to represent one language in the context of another shouldn't be identified as being either, but as a cross-linguistic transcription. Anyway, the reason I'd tagged it as both 'phonetic' and 'French' is that it appears to be specifically designed to be understood by readers who assign French values to the letters of the Latin alphabet. kwami (talk) 19:23, 21 April 2024 (UTC)
Here (and on the preceding page 13) it's used to transcribe 'patois' pronunciations; I don't see which "Vincelles" is being discussed, but the 'dialect' appears to be Franco-Provençal.
There's a clearer description (though the copy is too light to be easily legible) here, in a French-language description of a Greek dialect. kwami (talk) 20:09, 21 April 2024 (UTC)
Hmm, I think I see your point. But it seems very debatable to me. I do think it would be clearer to explain the situation within the definition, rather than to {{label}} it as French Translingual; something like "A symbol used to in French." And while I can see your argument for having it as Translingual, if it's only used in French, I still see the argument that it's just French ... and I can see the argument that it may not be the kind of thing to include at all. I mean, we don't say ш is used to transliterate/respell English sh and German sch, and we don't say ja is used to respell я, it seems like some things are considered to be nonlexical. - -sche(discuss)21:09, 21 April 2024 (UTC)
I don't see any qualitative difference between this and IPA, NAPA or Merriam-Webster transcriptions. True, we don't give ш as the Cyrillic transliteration of German sch, and doing so would mean potentially adding a huge number of additional definitions. Since we're English WK, should we restrict transliterations and phonetic transcriptions to those used in English-language sources?
The reason I added these in particular was that we had translingual sections in these articles that didn't have any actual content, and no evidence of translingual use. I've gotten in trouble for deleting sections with no content, but didn't want to leave them in such a bad state, so I did a search for translingual use and this is what I was able to find. kwami (talk) 21:33, 21 April 2024 (UTC)
┌────────────────────────────────────────────────────────────────────────────────────────────────────┘
@Kwamikagami, -sche, Benwing2: This system of phonetic representation appears to be a creation of Jean-Pierre Rousselot. AFAIK, it originates in:
Can you see what the difference between 'nasale' and 'demi-nasal' is? Can that be rendered in Unicode?
Benwing, I thought my original sources were a bit later, but still early 1900s. Around the era that Americanist notation was being developed. kwami (talk) 23:09, 21 April 2024 (UTC)
@Kwamikagami But AFAIK the Americanist notation is still in use, and in any case has seen quite widespread adoption; my concern here is that these symbols might be idiosyncratic to one or a few authors from a particular time period. Benwing2 (talk) 23:11, 21 April 2024 (UTC)
I assume that they're defunct, just as many Americanist symbols are. (The system remains in use, but with a reduced inventory that gets closer to IPA over time. Some Americanist symbols aren't even supported by Unicode.) But this system does seem to have been used by a number of authors.
Anyway, the reason I added these was to have some content in the translingual sections of those articles, not because I thought they were particularly notable. My first impulse would be to delete those sections, but I've gotten burned doing that before. I wouldn't object to them being deleted though. kwami (talk) 23:21, 21 April 2024 (UTC)
@Kwamikagami: “Nasalité” is discussed in Rousselot 1887 (15–16), but I didn't glean the difference thence (but then, my French is poor). The graphic difference between “nasale” and “demi-nasale” is extremely slight (almost nonexistent) in both Rousselot 1887 and his 1891 recapitulation. I'll have a look at Unicode tildes to see whether they encode it.
@Benwing2: I don't know whether this system is still in use, but the texts that make use of it still exist, so the information is still valuable, IMO.
He doesn't mention 'demi-nasal' there, though he does speak of weak nasalization, where you need to place a mirror under the nose to even tell that it's there. I don't know if that what he meant or not, but I was more concerned about whether the text could be digitized. kwami (talk) 00:14, 22 April 2024 (UTC)
There's an encoding problem with the Vietnamese "apex", which really is just a tilde; the problem is that the Unicode tilde is used as a tone mark, so something else needs to be used for the true tilde (apex). So if there were a solution to Rousselot notation, that might could be used for Vietnamese as well. An IPA diacritic is used on Wiktionary and Wikisource, but that's not ideal, and the medievalist -ur won't work for various reasons I don't fully understand. kwami (talk) 00:18, 22 April 2024 (UTC)
@Kwamikagami: Judging by this image, that apex doesn't look like a tilde. w:Vietnamese apex uses ◌᷄ for that diacritic, which is imperfect, but probably as close as can be got using Unicode. Rousselot probably just used a different typeface's tilde for the demi-nasale. Or maybe one of the tildes is actually a perispomene (◌͂). Alternatively, but rather less probably, France had already had Vietnamese territories for fifteen years at the time Rousselot devised his système graphique, so one of those tildes might even be that Vietnamese tone mark (probably not the apex though, since that had been superseded by -ng long before then). 0DF (talk) 01:09, 22 April 2024 (UTC)
The modern, wavy form of the tilde is rather late. In the era the apex was used in Vietnamese, it looked just like the tilde in Portuguese, which was called the 'apex' at the time. The Vietnamese tone mark was evidently the perispomene, and got miscoded in Unicode. kwami (talk) 01:12, 22 April 2024 (UTC)
@Kwamikagami: Yes, the serpentine perispomene is, strictly speaking, incorrect, but I've seen it many times in nineteenth-century Greek texts. 0DF (talk) 01:37, 22 April 2024 (UTC)
bullet points, usage notes and etymologies
Diff was reverted, but wasn't it correct to add the bullet point? Don't we normally bullet usage notes? (And on the topic of bullet points, don't we normally not bullet etymologies? Because I sometimes see people bullet them.) Do we have enough consensus about whether these sections should vs shouldn't be bullet-pointed to add something to WT:ELE about it? - -sche(discuss)00:34, 14 April 2024 (UTC)
@-sche Yes, I have gone through several times and added missing bullet points to Usage notes. I thought this was the standard, and also I agree there shouldn't be bullet points in etymologies. Benwing2 (talk) 00:39, 14 April 2024 (UTC)
I’m not sure a bullet point is needed if there is only one usage note, but don’t mind if one is added. On the other hand, I do think that it is occasionally desirable to use bullet points in etymologies for readability, for example, if a term is partly derived from two languages. — Sgconlaw (talk) 03:23, 14 April 2024 (UTC)
@Sgconlaw: That's true, bullet points are useful if an etymology often takes the form of a list, such as at kibosh. But I think we all agree that it shouldn't be the default, so accomplished for example shouldn't have a bullet. Ioaxxere (talk) 04:54, 14 April 2024 (UTC)
I agree with this, kibosh is fine. If we add something to WT:ELE my suggestion, subject to wordsmithing/improvement please, would be along the lines of: either "The first sentence of an etymology should not be bulleted. Other parts of the etymology may be bulleted if they are a list." if we think lists should always be introduced by something unbulleted ("Uncertain. Theories include:", etc), or at least "The paragraphs/sentences of the etymology section should not be bulleted unless they are a list." IMO a list must also contain more than 1 item, a "1-item list" is not a list and should just be unbulleted; for bullet points to be used, there should be multiple items IMO. - -sche(discuss)15:20, 14 April 2024 (UTC)
FWIW, as of a database dump from last August (what I had handy), 3,834 pages contained Etymology===\n\* like abdominothoracic (very many of them added by just one user), and 24,770 pages contained Usage notes====\n like Abrahamic. I did this just to quickly get a rough figure, it does not account for numbered etymology sections or people using Usage notes at L3 or L5, and some pages have probably changed since August, to stop having the problem or to newly have the problem. Iff we agree on fixing these, maybe a bot (operating from a more up-to-date list) could remove the bullet from any etymologies where there is only one "item"/sentence/paragraph (and any etymologies like this, where the first item is bulleted and the second item is a bulleted affix template?), and we could see whether what's left is a small enough number to go through by hand? (In case some instances of multiple bullet points are OK.) - -sche(discuss)15:21, 14 April 2024 (UTC)
Noting the vast number of unbulleted usage notes, I do not support bulleting them universally. It would be interesting to compare the number of occurrences of Usage notes==+\n with Usage notes==+\n\*. I suspect the former will strongly prevail but I'm open to being proven wrong. This, that and the other (talk) 22:53, 14 April 2024 (UTC)
Bullets should be used for lists of similar items, such as:
derived and related terms lists
your shopping list
lists exactly like this one.
Usage notes are not inherently listlike. They do not consist of "items". Rather, they are written using one or more full sentences, so it feels more natural to present them as distinct paragraphs. This also helps to visually distinguish the usage notes from the numbered lists of senses and bulleted lists of terms that generally surround them. This, that and the other (talk) 23:41, 21 April 2024 (UTC)
@This, that and the other: That makes sense. Would you agree with usage notes being bulletted when there are multiple topics in the section (i.e. when they're actually usage notes, rather than a single usage note)? 0DF (talk) 01:14, 22 April 2024 (UTC)
I think TTO's comments make sense except that I think multiple usage notes are a lot more list-like than paragraph-like, since they're generally unrelated to each other. Benwing2 (talk) 01:16, 22 April 2024 (UTC)
As far as I can tell, all of the people who've commented here on the question of bulleting etymologies have said bulleted lists are fine but non-lists like accomplished are not. Is this being discussed somewhere else where someone has proposed a blanket "ban"? - -sche(discuss)17:18, 21 April 2024 (UTC)
...is clearly (and openly) a bot operated by User:This, that and the other. The bot seems to be running some maintenance tasks of a few Wiktionary: namespace pages that contain lists for various tasks. While the user is trusted and the bot doesn't seem to be doing anything that dangerous, it should still in principle have the bot flag, but it does not, nor does it seem a vote was ever started to even obtain one. — SURJECTION/ T / C / L /18:00, 14 April 2024 (UTC)
Well, by the reasoning in WT:BOT, bots need vote approval and control to avert that they run amok and leave messes not cleaned up after. Not the issue for the list-type pages the bot has only edited that are not expected to be created manually in the first place. Fay Freak (talk) 19:54, 14 April 2024 (UTC)
The etymology tree testing thread, which has run for two weeks now, has achieved good results with several minor bugs or problems being fixed and every commenter being happy with the output. Therefore I feel that I'm ready to put the template up for a vote to establish consensus on whether it can be used on mainspace. I think that language community should establish its own policy on where etymology trees may be used, which may well be "nowhere". However, before I start any vote I would like to address some objections that were made in last month's thread.
@Benwing2 said that we should avoid duplicating information between the template and the etymology sections, and this is something I agree with. It's just that our current etymology sections are fairly inefficient at representing information and can be automated to a significant extent. One of the things I've been working on is automatic text generation, which you can see here.
@Victar commented that we would need "a very complex module, with ways to mark derivational types, certainty, alternatives, mergers, etc." All of these have been integrated into Module:etymon.
So I invite the community to discuss whether the module is ready for a vote, and if so, how it should be worded.
^ Specifically: a) unbolded transliterations, b) fixed overflowing text, c) decreased font size, d) fixed visual bug on Firefox
^ Note that {{etymon}} can run in a "silent mode" which produces no visible output, but passes along information to other entries. This could be added anywhere.
Edge cases maybe difficult, some concepts in Etymology are not as clear to convey like Wandelwörter, reference @benwing2. But it seems you should vote on it. Also will be difficult to display doublets or cognates ADDSamuels (talk) 01:49, 15 April 2024 (UTC)
Will we have a policy of eliminating lies, such as the claim that most Indic lects are descended from Sanskrit as normally understood and as usually described by Wikipedia? (A solution is to direct the user to WT:About Sanskrit instead of to w:Sanskrit. Dubious edits on Wikipedia to promote the Wiktionary definition of 'Sanskrit' seem to have disappeared.) --RichardW57m (talk) 10:14, 15 April 2024 (UTC)
our current etymology sections are fairly inefficient at representing information and can be automated to a significant extent.
I think that's a problem that needs to be solved before deploying this. For this to be useful, it needs to be widespread, and for it to be widespread it needs an automated way to be integrated into most of the existing etymologies plus manual effort to integrate it where it can't be added automatically. How can we replace the existing information in the etymology sections with this template? Do you have a parser to convert existing etymology data to this template? How many etymologies can be automatically replaced and how many will need manual work? JeffDoozan (talk) 13:49, 15 April 2024 (UTC)
I think the biggest hurdle would be the fact that each etymology would need an ID. Those with {{etymid}} already present could just give that to the template, but those without could not. You could even probably generate ID's for most entries that only have 1 etymology (i.e. affixed adverbs and the like) and even only a few definitions, but you'd need a way to generate the actual ID name. I forsee a major hurdle with entries with many etymologies/definitions. Vininn126 (talk) 14:09, 15 April 2024 (UTC)
@JeffDoozan: User:Vininn126 is right in that assigning IDs is a significant challenge. Basically, if an entry says "borrowed from X", does it mean X (etymology 1) or X (etymology 2)? Sometimes the ID is specified, but not often enough to be very useful. However, I have been working on getting structured etymological data from Wiktionary. Here's a sample:
For what it's worth, I Support the idea of this template. I think it will attract a lot of readers. Having asked some various people online (hearsay evidence, I know, it would be nice to have an official account where we can poll people on other platforms...), they seem to generally love this. Probably not possible at the moment to deploy it large-scale, but is there anything stopping one from adding it to pages manually? Vininn126 (talk) 08:53, 16 April 2024 (UTC)
@Ioaxxere: The family-tree–style presentation seems wasted on linear descent, as in the case of father. How does this handle branching derivation and cognates or other relations? I see real promise there. 0DF (talk) 02:10, 22 April 2024 (UTC)
@Ioaxxere: Oh yeah, that looks great! Re cognates, I believe your respondents in that section were all under the impression (as I was) that you were proposing a template that would generate mere lists of cognates; IMO, presenting with a table like the one you present in this section would be an entirely different, and considerably more attractive, one. 0DF (talk) 03:57, 22 April 2024 (UTC)
@Vininn126 I think it needs a plan for bot-converting existing entries to the extent possible, possibly with some manual help. Otherwise it will end up underused and a pain to maintain. Benwing2 (talk) 18:05, 27 April 2024 (UTC)
@Benwing2 I think I'd agree - but does the current vote impede that somehow? Or would you rather have a vote to bot-implement it site-wide instead of what is here now? Vininn126 (talk) 18:08, 27 April 2024 (UTC)
I just think we need that plan before we vote on whether to allow it. I don't us to end up with another {{etymtree}} (for reference, that was what came before {{desctree}} and it used a separate module, or something of that sort, to store descendants rather than scraping them from a given mainspace page; it was too hard to use and ended up poorly deployed, but hung around for years and was a pain to maintain and another pain to finally get rid of). Yes, the trees look nice but unless they're widely deployed, the whole exercise will be futile and will end up a maintenance headache. Benwing2 (talk) 18:42, 27 April 2024 (UTC)
Ordering entries differing in lexicographic spelling
Do we need a policy on ordering the entries of the forms of the same lemma when they appear on the same page? For example, Latin nigra and nigrā are two different entries on the same page, and are different case and gender forms of the same lemma, Latin niger, which is on a different page. Indeed, should such forms have different entries? I see that Latin mala(“bad”) and malā(“bad”) share the same entry, the writing with the macron only being distinguished as a label for the pronunciations! I may not even be consistent myself - I may have done the entry for Lithuanian Alfrede wrong, as the two locative singular forms correspond to different citation (nominative singular) forms. --RichardW57m (talk) 14:53, 15 April 2024 (UTC)
Recent changes to the citation templates
User:JeffDoozan has been making a lot of changes to the citation templates as of late with their bot, AutoDooz. Some have introduced errors, but the change that I have to object to above all is changing parameter |1= to |lang=, and despite me bringing this up on their talk page, they went ahead with their script today, regardless. The rational given for this change is that it unifies the citation templates with the quotation templates, but this is argument fallacy because these templates have very different purposes. For one, we only specify the language of work if it isn't English, which mind you, isn't even done at all in other citation formats, ex. Harvard, Oxford, etc. Secondly, many journals are in multiple languages, or others, like lists, are in no language at all. Making language a mandatory field for citation templates is all around a bad idea and was done with no community consensus. @Benwing2 -- Sokkjō18:04, 15 April 2024 (UTC)
Including a language is not mandatory, as discussed here. Works in multiple language can use a comma separated list of language ids in either |1= and |worklang=. JeffDoozan (talk) 18:12, 15 April 2024 (UTC)
But is it mandatory because if you use the template with just the numbered fields, you have to at the very least leave the field blank. -- Sokkjō18:18, 15 April 2024 (UTC)
Definitely this should have been discussed more before implementation. However, it's not clear to me it's wrong. It is a bit strange to have an optional langcode parameter in |1= but I do understand on the one hand the desire to harmonize the interfaces of quote-* and cite-* and on the other hand the fact that we don't categorize citations, so it's not strictly necessary to have the language specified for all citations. I actually think having numbered params (other than something |1= for a language code) for these templates, whether quote-* or cite-*, is a bad idea, because they're hard to make sense of when reading the wikitext and highly error-prone when creating the wikitext (esp. since there are a lot of them and every citation template is different from every other one). I proposed eliminating numbered params for the quote templates but some people objected, so I just eliminated them on the less-used ones and kept them e.g. for {{quote-book}} and {{quote-journal}}. Yes it requires a bit more typing to use named params but inserting a quotation or citation takes a lot of typing anyway, so the net effect isn't that much. Benwing2 (talk) 00:42, 16 April 2024 (UTC)
@Benwing2: What myself and other users do is use the numbered parameters for quick inline citations, i.e. {{cite-book|year|author|title}}. Having language as a first parameter would mean we would have to do {{cite-book||year|author|title}}, leaving a blank field, which is very prone to error, either from people forgetting to do so, or from people misinterpreting it as typo. Certainly when creating a reference template, I use full parameter names. -- Sokkjō01:48, 16 April 2024 (UTC)
What is true about it is that the text in langcode is in less than optimal places, being placed after journal names, while we always encode the language of the journal piece. Fay Freak (talk) 18:24, 15 April 2024 (UTC)
Multiple Quotations
Reading WT:ELE in response to the call to bullet usage notes, I noticed the following text:
Quotations are generally placed under the definition which they illustrate. If there is more than one being provided, or where this is not possible (e.g., a very early usage that does not clearly relate to a specific sense of the word), a separate section should be used.
Are we meant to take this seriously? If so, is there any guidance on how (or whether) to relate quotations to senses? Or any good examples of so doing? --RichardW57m (talk) 14:59, 16 April 2024 (UTC)
My preference (and I don't think I'm the only one) is to eradicate ====Quotations==== sections by assigning the quotations to senses or, if it's truly unclear whether they use any of the definitions we're providing, then moving them to the cites page — how weird it would be if quotations of senses that don't meet CFI were given extra prominence with their own section in the main entry! If ELE says that merely having more than one quotation means they should not go under the definition anymore, that definitely needs to be changed, yikes! We have so many entries where a definition is supported by more than one quotation under it... :o (prior vote btw) - -sche(discuss)16:26, 16 April 2024 (UTC)
100% agreed. ===Quotations=== sections are unhelpful, even more so than ===Synonyms=== and the like because at least the synonyms sections (sometimes) identify which sense the given terms are synonyms of. Benwing2 (talk) 18:47, 16 April 2024 (UTC)
@-sche: Most lemmas don't have any quotations, though a high proportion (very roughly half) are backed up by dictionaries. (And most entries are non-lemmas without quotations.) I actually found it quote hard to find senses with two or more quotations. I suspect it's mostly those senses that have been challenged that have multiple quotations. For Pali I'd been working on the principle of keeping the best set of three with the sense, but I may not have had any extras that seemed worth putting on a citations page. I think the number three may simply have been the WDL requirement. --RichardW57m (talk) 16:56, 17 April 2024 (UTC)
@RichardW57m I think you're right about this; senses that are obviously attestable often get only one quotation to illustrate them, but those that may be challenged are more likely to get three. Benwing2 (talk) 21:36, 17 April 2024 (UTC)
Next steps
Alright, to move forward, I propose to do the following:
Quotations are generally placed under the definition which they illustrate. Where this is not possible (e.g. if a usage does not clearly relate to a specific sense), they should be placed on the Citations page. Less illustrative quotations, especially of senses that already have very many quotations under them, may also be put on the Citations page.
move that updated verbiage to a better area of the page (probably Wiktionary:Entry_layout#Definitions, introducing it like "Definitions may be illustrated by quotations.), and
remove the authorization of ====Quotations==== headers/sections. Phase such sections out, moving the contents of such sections to be either under relevant definitions or on Citations pages.
I note that the problematic text about not putting multiple quotes under the relevant sense has been unchanged since 2006, before we collapsed quotes and before the Citations: namespace existed!
WT:EL notes that it "should not be modified without discussion and consensus", so I want to make sure there's consensus. - -sche(discuss)00:10, 9 July 2024 (UTC)
Getting rid of wording that we haven't been paying any attention to seems essential. The Quotations header seems to sometimes house famous or interesting citations that are ambiguous with regard to definition. That kind of citation is a good thing to acknowledge, but doesn't help with the basic need to have our users understand our definitions. Citations space is the obvious place to put such a citation and it might be a worthy task to point out the ambiguity on the Citations page. DCDuring (talk) 01:45, 9 July 2024 (UTC)
Support wholeheartedly. I remove Quotations sections on sight if at all possible and I think our goal should be to have at least three cites under each sense. Andrew Sheedy (talk) 02:01, 9 July 2024 (UTC)
If the cites on the citations page are of one or more of the definitions present on the page, then the {{seeCites}} link would be moved under the definition(s) (where it belongs). If the cites are of sense(s) that don't meet CFI, then the template can just be removed; there's already a link to the Citations: page atop every page. - -sche(discuss)07:18, 10 July 2024 (UTC)
Currently, our Japhug entries are using IPA transcription as the main script. This follows almost all publications on (Kamnyu) Japhug, all of which are by Guillaume Jacques (and his co-authors). Guillaume Jacques' grammar on the language does mention: "The IPA orthography chosen to write Japhug in this grammar is probably not viable for use by native speakers, and an alternative writing system based on Tibetan script is preferable." The grammar does have a page and a little on how Tibetan script can be used for writing the language, though there aren't really many details given. After some discussion on Discord with @Thadh, we thought it best to bring it here to see if people have thoughts on how to proceed. To me, there are two possibilities. First is to keep the status quo and continue using IPA for lemmatization; Tibetan script could be added perhaps automatically to entries beside the headword. The second option would be to use Tibetan script for lemmatization, perhaps with IPA transcription as the romanization that appears beside headwords. — justin(r)leung{ (t...) | c=› }15:34, 16 April 2024 (UTC)
@RichardW57m: It's not us who developed the script, it's described in the grammar. And the language itself is not written at all, it's only attested in grammars and scientific papers. If anything, people are going to thank us for using something that even resembles human language rather than a scientific transcription. Thadh (talk) 15:59, 16 April 2024 (UTC)
People could also say that we're making things up with no proof, and we're chauvinists assuming that a language uses this script. We need to find people writing the language down, and see what script they use. CitationsFreak (talk) 16:26, 16 April 2024 (UTC)
The language uses no script. In my opinion it's better to use some script (especially when it is proposed already!) than use no script. If at some point we hear from the speakers that they developed an alternative orthography, we can always switch to that. Thadh (talk) 16:31, 16 April 2024 (UTC)
@Thadh: Is Japhug in the Tibetan script used to convey meaning? If not, the words in the Tibetan script don't meet CFI. And it doesn't seem to meet the principle of independence, so being an LDL should be irrelevant. --RichardW57m (talk) 16:27, 16 April 2024 (UTC)
If the language is not normally written, and the only orthography used when it is written (or written about, by scholars) is the IPA-ish transcription, then it seems like creating unnecessary hurdles to require people to figure out how to translate the words as they are actually written into someone's speculative "maybe if people wrote this in Tibetan script, they'd write it like this" orthography. I say we just continue using the attested (IPA-ish) orthography, until such time as a Tibetan orthography actually becomes used. - -sche(discuss)16:31, 16 April 2024 (UTC)
Lemmatising at a practical orthography is standard practice for dictionaries. I think using IPA would be a grave mistake, and would also give out a "we don't really care" attitude. Thadh (talk) 16:35, 16 April 2024 (UTC)
I agree with User:-sche here. Lemmatizing at what is at this point essentially a con-script seems worse in many ways than using the IPA that publications actually use. Benwing2 (talk) 18:44, 16 April 2024 (UTC)
Thanks for all your input. I generally sense that the general sentiment is to stick with the status quo and use IPA transcription for lemmatization purposes. I am sympathetic towards this option as well. (Sorry, Thadh!) The question now is whether to show a possible Tibetan orthography in entries at all based on the schema described in Jacques Guillaume's grammar. — justin(r)leung{ (t...) | c=› }00:28, 18 April 2024 (UTC)
Because we don't want to make it easy for new normal users? I don't think synchronic or its morphs are part of the idiolect of more than 5% (1%?) of normal users. What we mean by surface analysis is not much better, but at least the words are understandable. Superficially would be much clearer, though it is too (negatively) evaluative. DCDuring (talk) 12:44, 17 April 2024 (UTC)
Solvable by simply linking to a glossary entry that explains the meaning. I doubt most visitors would know what a determiner is either, yet it would hardly be a reason to remove the correct linguistic term in favour of a fictitious alternative. Moreover our using a fictitious term gives readers the misleading impression that it actually exists. Nicodene (talk) 18:57, 17 April 2024 (UTC)
I generally support this but I'd like to hear how you propose to convert things over. Can we just replace all uses of {{surf}}? If not we will potentially get a messy situation with both {{surf}} and {{sync}} coexisting indefinitely. Benwing2 (talk) 23:34, 17 April 2024 (UTC)
Step one: {{surf}} is deprecated, step two: a bot converts {{surf|X|Y}} to {{sync|X|Y}} in cases where X and Y combined have the same spelling as the entry? (Allowing for the loss of a vowel at the end of X.) That should avoid most of the synchronically invalid combinations. Whatever is left will require human attention, unfortunately. I can take care of various European languages at least. Nicodene (talk) 05:24, 18 April 2024 (UTC)
Note that {{sync}} already exists as a redirect to Template:syncopic form. (Personally, I think "sync" seems too much like the name for a template for syncing entries, and I'd rather find another short name for both "syncopic" and "synchronic" if possible.) IMO, if we want to change the name or wording of {{surf}}, I don't know if it makes a difference (in the long run) whether we slowly change uses of one over to the other, or just move (redirect) the existing template to the new name and update the wording (and potentially switch all uses over to the name main name all at once). As far as I have been able to tell, even going by our glossary definitions of these things (which say a surface analysis is a synchronic one), the places {{surf}}should be used and the places "{{sync}}" should be used seem to be identical. It's true that {{surf}} is currently used in some places it should not be, e.g. again says "By surface analysis, on- + gain (“against”)" which is wrong (no speaker who is unaware that again comes from Middle English thinks that it looks on the surface like it was formed in modern English by combining "on-" and "gain"! this should be a "For more, see..." set of links or something)... but A) such improper uses need to be cleaned up whether we move/rename/switch the template or not (and basically the same process the Nicodene outlines for switching between templates would work even if just making a list of all uses of {{surf}} and progressively removing valid ones, or invalid ones after correcting them), and B) like people use {{surf}} in wrong places, people would undoubtedly also use {{sync}} in wrong places, particularly because DCDuring is right that "synchronic" would be an even more opaque name to many people (and the short name "sync" just makes it seem like a template to sync entries or sync links to entries or something), so if they see {{sync}} in entries, I expect the main takeaway for at least some people will just be "this is the way to link whatever words are closest to being the parts that make up the word in question", and that's how they'll use it, even in places we don't want it used, like again. That itself doesn't necessarily mean we shouldn't use "synchronic"; lots of other words we use are unfamiliar to laypeople, as Nicodene says; but it means I wouldn't expect renaming the template to be any help as far as making people only use it in the right places. I am actually sympathetic to DCDuring's point that the word clearest to laypeople would probably be "superficial", and indeed it's not hard to find linguistics literature discussing "superficial analysis" of how words are formed vs their actual etymologies, so maybe we should consider that. Heck, what if we made the wording something like "Superficially or synchronically derivable from x + y"? (We could probably even make it so that users could optionally suppress seeing "superficially or", in the same way people can opt out of or opt in to seeing {{,}}, so the 'clearer' word would be shown to laypeople, but logged-in linguists could choose to only see "synchronically".) - -sche(discuss)16:41, 18 April 2024 (UTC)
@Nicodene: I agree that we should try to reduce ambiguity but as others have said we should be avoiding jargon. How about this wording?
boldly -> "Analyzeable as bold + -ly." (synchronic)
outrage -> "Etymologically equivalent to ultra + -age." (diachronic)
The template names could be {{etyeq}} and {{anz}}.
It is worth noting that the ‘correct’ (synchronically valid) context for using {{surf}} is effectively identical to the correct context for using {{affix}} and {{compound}}. (The only difference seems to be that {{surf}} has been used alongside derivations from other languages, as in ‘from French kilomètre, by surface analysis kilo- + metre’. Changing the wording of {{affix}} and {{compound}} to ‘composed of X + Y’ would cover these and all other use-cases, I think.) That means we could clean up the whole mess like this:
2) Have a bot replace {{surf}} with {{affix}} or {{compound}} using orthography as a guide. (If no affixes are involved, the bot assumes it's dealing with a compound.) The remaining transclusions are dumped into a list.
3) (Optional:) if said list is quite long, identify repeating patterns and have a bot clean those up. For instance if a Latin noun lemma ends in x, derivatives will often have ci or gi instead (pacificus < pax + -ficus).
4) Manually review whatever is left, assigning {{affix}} or {{compound}} where appropriate.
That leaves us with etymological relations like husband < house + bond. These can either be deleted outright (they are a bit silly, really) or else assigned to a new template, as @Ioaxxere describes. In the latter case I would favour a wording like ‘etymologically corresponds to X and Y’ to avoid implying any kind of synchronic validity. Nicodene (talk) 23:01, 18 April 2024 (UTC)
@Nicodene There's actually no need to use {{compound}} as {{affix}} handles compounds already. However I don't think it will work to just replace {{surf}} with {{affix}}; {{surf}} includes additional text to note that it's "surface etymology" aka synchronic. Replacing it as proposed will remove that information and lead to etym sections that don't read properly. However, I agree that cases like "husband < house + bond" should just be deleted; I don't really see the point. Benwing2 (talk) 23:05, 18 April 2024 (UTC)
Yes the conversion is a bit tricky because of the difference in wording.
From a look through the transclusions of {{affix}} I see we have the same diachronic issues there as we do with {{surf}}. One of the first transclusions of {{affix}} that comes up is the (rather brave) month < moon + -th.
Perhaps we can begin our cleanup with {{affix}}? The procedure would be something like this:
1) Change the wording to ‘composed of X + Y’. (Compatible with the use-cases of {{surf}}.)
2) Have a bot run through the transclusions of {{affix}}, removing preceding text like ‘from’/‘equivalent to’ and dumping entries where X + Y ≠ ⟨spelling of lemma⟩ into a list for further review.
@Nicodene I think {{af}} should not be preceded by any text. If we want a version of {{af}} preceded by text, it should be a separate template. We already ran down this road with {{bor}} and {{inh}}, which at one point had preceding text "Inherited from" and "Borrowing from" (note, not "Borrowed from" as you'd expect) and was later switched to not have that text. So I would advocate a separate template to express surface/synchronic derivations — just like we already have, except maybe it should be renamed and the wording corrected. Benwing2 (talk) 01:11, 19 April 2024 (UTC)
Great. Since a cleanup of {{affix}} transclusions doesn't require any template change, perhaps that can be the testing ground? Not sure how effective the aforementioned ‘orthographic method’ will be, even if the bot allows for the loss of a vowel at the end or beginning of a morpheme and accounts for alternations like x~ci/gi. If I had a list of flagged words I could comb through it, looking for additional patterns to teach to the bot until it can cut the list down to a size that humans could deal with manually. Nicodene (talk) 01:43, 19 April 2024 (UTC)
A list of {{affix}} transclusions where the combined components do not orthographically make the lemma (moon + -th = *moonth ≠ month). Excluding cases where there is a discrepancy because a written vowel is lost (surprising < surprise + -ing) or x alternates with c(i)/g(i) (vocifer < vox + -fer).
Then I look through the list and find other rules for things to exclude (e.g. ad- + lumino = allumino, because /d/ in that prefix tends to assimilate). The goal is, eventually, to cut the list down to only invalid cases like ‘month = moon + -th’. Nicodene (talk) 05:56, 19 April 2024 (UTC)
@Nicodene: {{surf}} is used when the term wasn't actually formed within the language, but can still be analyzed as though it were. Saying that English binary is actually "from" bin- + -ary, for example, would be completely ridiculous. Merging the templates would lead to important information being lost. Ioaxxere (talk) 17:53, 19 April 2024 (UTC)
@Ioaxxere No, it wouldn't. The proposed change is {{af+|bin-|-ary}} ‘composed of bin- + -ary’, which is correct, and the preceding ‘from Late Latin binarius’ will still be there. Also {{affix}} is used in exactly the same way (cf. the entry anti-Semitism) so no, this isn't showing any difference between the templates.
Also the conceptual dividing line isn't actually clear in most cases, as discussed previously regarding boldly, where (I'd argue) the assumption that the word was passed down in an unbroken chain across a thousand years, and never reformed from its components, is completely ridiculous. Even so, nothing about the proposed wording for {{af+}} affects this one way or another. Nicodene (talk) 18:53, 19 April 2024 (UTC)
You're right that {{af}} is used ambiguously, but that doesn't mean we should be converting a precise template ({{surf}}) to a vague one ({{af}} / {{af+}}). It's not about the wording but about the wikitext itself losing information. Ioaxxere (talk) 20:42, 19 April 2024 (UTC)
There is nothing precise about the Wiktionary-ism ‘surface analysis’, as all the preceding discussions about what it should mean have shown.
Zero information is being lost, because the preceding text (‘from Late Latin binarius’ and such) is not going to be deleted. Nicodene (talk) 21:15, 19 April 2024 (UTC)
┌┘
Okay let’s simplify things. Cleaning up the entire website’s (mis)uses of affix or surf all at once would be a gargantuan task.
Far simpler proposal: have a bot replace {{surf}} with {{af+}} which generates a preceding ‘derivable from’. This phrasing strikes me as jargon-free but still precise. Thoughts? Nicodene (talk) 03:22, 20 April 2024 (UTC)
@Atitarev @Benwing2: Currently the Russian clitic entries barely have any presence in English Wiktionary. But it's probably possible to borrow many of them from Russian Wiktionary.
I also have corrected one inconsistently formatted entry there. Do the missing headword entries just need to be created in English Wiktionary? Or, similar to how it's done in Russian Wiktionary, some kind of a special template can be introduced for listing clitics in the parent word entries?
BTW, as a person from Belarus, I personally find a lot of these clitics weird and unnatural to various extent. I understand the reason and necessity of the accent pattern in до́ смерти(dó smerti)" or на́ хуй(ná xuj), because these word pairs are not to be interpreted literally and have a different sense of their own. But many others just feel to me like somebody is trying to sound deliberately poetic or archaic when reciting a fairy tale or something. And, for example, I doubt that anyone from Belarus would ever say "до́ дому" in their Russian speech, maybe because of the influence from the Belarusian дадо́му(dadómu)? The whole concept of the prepositions stealing stress from the next word doesn't exist in the Belarusian language. I understand that it's the standard Russian pronunciation norm and has to be acknowledged as such, I'm just trying to say that my Russian language competence is definitely lacking in this area. I'm primarily interested in the Russian clitics for the purpose of correctly handling them in the auto-accenting Lua module. --Ssvb (talk) 04:18, 17 April 2024 (UTC)
@Ssvb I am not a native Russian speaker; I'm sure Anatoli can answer better. I'll just note that Zaliznyak's grammatical dictionary notes the occurrences of such stress-stealing in the headwords of each word where it occurs. I would not necessarily recommend creating entries for all such combinations unless they are idiomatic. I think it's enough to note them in a usage note in the noun. I'm not sure if we need a special template for this, esp. since I imagine the conditions under which this stress-stealing occurs are rather varied and some of the expressions may be archaic, poetic, etc. as you note. Benwing2 (talk) 04:31, 17 April 2024 (UTC)
As an example, under нога it has a diamond symbol (indicating special usages) followed by this:
Not really. May be you can replace ходить with идти, but it's the same conjugation forms other than the infinitive. Ехать по воду (to drive for getting water) is also possible, but rare. Tollef Salemann (talk) 10:11, 18 April 2024 (UTC)
As someone who created a Russian clitic entry, I wonder if it really worth to do it, because it is not really a thing and it is very depending on dialect. We can of course take clitic stuff from dictionary, but it is gonna be useless about 20-40 years (as of modern Russian dialects, it is useless anyway). I mean, it is important to register clitics, but it seems not so easy as the dictionaries say. Tollef Salemann (talk) 06:49, 17 April 2024 (UTC)
@Tollef Salemann: These things are not reflected in spelling, so pronunciation may have diverged in different regions. Still only Moscow determines what is considered to be the official standard pronunciation. And "за́ руку" seems to be legit. But "на́ берег" - not so much and seems to primarily exist because of "Выходила на́ берег Катюша". --Ssvb (talk) 17:57, 17 April 2024 (UTC)
Moscow pronunciation is not the same as "standard" Russian pronunciation. The dictionaries have often differencies, and they are under update. Different primary schools may use different dictionaries as well. So the clitics seem as a mess for me. Tollef Salemann (talk) 10:16, 18 April 2024 (UTC)
@Ssvb Under берег, Zaliznak says "на бе́рег // на́ берег" which seems to mean both are possible but the first is more common. It is similar to the entry directly above for снег (the entries are sorted alphabetically from the end), which reads "по сне́гу // по́ снегу". Note also that sometimes the part after the // is enclosed in brackets, presumably meaning that variant is dated or dialectal or something. Benwing2 (talk) 04:26, 19 April 2024 (UTC)
@Ssvb: How would these be clitic entries? They look like clitic plus noun phrases, presumably suggested for inclusion because of idiomatic meanings or possibly (though I'm not sure of the validity of so doing) because they are not readily recognised as such. --RichardW57m (talk) 15:36, 17 April 2024 (UTC)
@RichardW57m: They are relevant and deserve to be documented because they have different pronunciation at least by some speakers (the true authentic Russians). But unless they are really idiomatic, they don't need their own headword entries each (I agree with User:Benwing2). Please disregard the red links in my starter comment, they are a red herring. --Ssvb (talk) 18:10, 17 April 2024 (UTC)
Having heard my Russian from Sovietized Qazaqstani Germans and Tatars, most, save distinct idiomatic ones, which are figurative but not literal uses of the vulgarities and apparently за́ бок(zá bok) which I only now hear, seem optional to me, the oftener I try to recall them, up to individual preference or even mood, some stilted, similar to Ssvb.
и́зо дня(ízo dnja) seems like an archaism and бронь(bronʹ) I have never heard, and also not a coincidence that I never heard the set phrase за версту(za verstu) in either stress. Interestingly it shows that some such phrases, including при́ смерти(prí smerti), are idiomatic only in some regions and registers of the Russian language area (but до смерти(do smerti) has optionally either stress for me and is idiomatic anyhow).
I also think that some of the phrases only have peculiar meaning and stress due to using a particular, stressed, sense of the preposition, namely за́ зиму(zá zimu) and за́ ночь(zá nočʹ) and на́ ночь(ná nočʹ) (all quite obligatory), which also constitutes the reason of the said figurative senses stress-stealing.
There seems to be an overlooked, hardly satisfactorily described, part of Great Russian grammar that figurative prepositional phrases switch stress to attain emphasis for the figure. Fay Freak (talk) 22:21, 17 April 2024 (UTC)
I think it's important to include {{&lit}} in all these collocations, so that people are aware that non-idiomatic meanings are also possible.
Let's look at на́ноги(ná nogi) vs нано́ги(na nógi) - "on(to) legs/feet"
I would use the latter when talking about putting on (shoes, pants) or if something is placed on legs/feet (the former is also OK in this case) but in the expression встава́ть на́ноги(vstavátʹ ná nogi) "to get to one's feet" (both literally and metaphorically), stressing the preposition (the former) would sound more natural. Anatoli T.(обсудить/вклад)00:55, 19 April 2024 (UTC)
@Ssvb: Thanks for the ping. I think many of the clitics listed can be created but they have to be filtered case by case. Just a few things to consider
бе́з соли(béz soli) sounds weird. I never heard it. без со́ли(bez sóli) is not an expression, IMO.
Both до до́му(do dómu) and до́ дому(dó domu) are valid. The latter sounds a bit rustic or folkloric.
Both за́ голову(zá golovu) and за го́лову(za gólovu) are valid. The same is true for many cases.
{{&lit}} can be used to clarify that a term can be both idiomatic and unidiomatic. Many clitics will fall into that.
Off-topic, your comment BTW, as a person from Belarus, I personally find a lot of these clitics weird and unnatural to various extent. surprises me. It's great if your Belarusian is better than Russian but unfortunately, it seems not many Belarusians mastered their own language. I heard 6% of Minsk citizens are fluent in Belarusian. I follow news from Belarus and many Belarusians gave interviews or answered reporters' questions in perfect Russian. Also, you can be even arrested for showing your preference to speak Belarusian over Russian. Anatoli T.(обсудить/вклад)01:14, 18 April 2024 (UTC)
Hey, I'm saying "béz soli", but "za nógi". They both ain't no really expressions, why to include such stuff? Only because the clitics? But there are thousands of them. Tollef Salemann (talk) 10:05, 18 April 2024 (UTC)
@Tollef Salemann: I have never heard "бе́з соли" myself, but it is mentioned in Ushakov Dictionary. Why to include such stuff? It's useful to inform the users that the stress pattern may be unusual in some cases. And also stress can be marked automatically in quotations, the Lua module just needs to identify tricky cases and avoid marking stress in them. For now I can take the list from Russian Wiktionary, but it would be great to be able to rely only on the information from English Wiktionary alone. Russian Wiktionary lists less than 200 of them. English Wiktionary currently mentions them in notes of the declension tables, e.g. the "нога́" entry. --Ssvb (talk) 13:22, 18 April 2024 (UTC)
Under год, it says "нá год; зá год; с го́ду нá год; го́д о́т году ; из го́да в го́д; бе́з году неде́ля". I don't know what без году неделя ("a week without a year"?) means. Benwing2 (talk) 01:26, 19 April 2024 (UTC)
@Benwing2: Thanks. бе́з году неде́ля(béz godu nedélja) is a jocular, often derogatory expression meaning a short time for something requiring longer time. For example when someone claims to have a lot of experience, even if they have worked in that area "a week without a year" (which makes it a negative duration). Pls check some quotes at Russian Wiktionary ]. Anatoli T.(обсудить/вклад)02:55, 19 April 2024 (UTC)
@Atitarev: If most of Wiktionary quotations and usage examples are consistently formatted (making them machine readable), then this information can be potentially used as a part of the training data for improving various AI models. Including, but not limited to, Google Translate. In other words, Google Translate can learn from Wiktionary, but not the other way around. --Ssvb (talk) 07:19, 19 April 2024 (UTC)
@Ssvb: Interesting perspective but I didn't mean to take the translation from Google Translate but shared my observation for people who don't know Russian and are not familiar with this expression. Google Translate did a poor job in this case and people shouldn't try to use it in this particular case (to understand its meaning). Even the literal translation doesn't quite clarify what it means, IMO. Anatoli T.(обсудить/вклад)07:00, 20 April 2024 (UTC)
@Atitarev: I apologize for the misunderstanding and I didn't imply that you suggested that. My point is that Google Translate (or its alternatives/replacements) will definitely improve in the future. And Wiktionary, among other things, may be instrumental in making this happen. Which brings another aspect: AI art is currently resented by the artists, who believe that the AI is plagiarizing their work and destroying their jobs. And in a similar fashion, Google Translate may take advantage of the work of the Wiktionary editors without giving them any credit. Google Translate may eventually learn from your без году неделя entry and start translating it correctly, which is a good thing for the humankind in general. But the question is: how would you feel about this? Some people may prefer to make entries machine readable, the others may prefer to make them deliberately obfuscated as a way to combat the AI. My personal opinion is that the latter would be futile and counterproductive. And by the way, I have no horse in this race, as I'm not employed by Google Translate or by any similar services. --Ssvb (talk) 07:47, 20 April 2024 (UTC)
@Ssvb: No worries at all. I was just sort of amused as if we need to help Google to improve their algorithms. I am easy about it, though. If our entries help everyone, it's only better. I took them from Reverso, anyway. Not sure if I need to quote. Anatoli T.(обсудить/вклад)07:53, 20 April 2024 (UTC)
I made thesethreeedits this morning, each indicating in a slightly different way that a given entry is the INN spelling of the name of a drug. I wonder if we should have a standalone template for this. If not, could the {{altsp}} template be made such that we could put "INN" in the from field and have it link to Wikipedia or to the glossary? Best regards, —Soap—08:58, 17 April 2024 (UTC)
And we could also use {{altform}}, which is perhaps more accurate. At first I avoided altform because i thought its labels could only be parenthemes, but it seems that I can type
So if we could only have INN link to Wikipedia (or to a glossary entry, if we prefer), I think this would be the best solution of all. Ideally, it would also categorize just as labels like US do. Then anyone could see all the INN spellings (at least the ones we get to) all laid out in a list. —Soap—09:15, 17 April 2024 (UTC)
{{alt spell}} is intended for cases that are really just spelling variants with no pronunciation differences. For these purposes I don't think it matters if some dialects merge /t/ and /θ/, I would still use {{alt form}}. Benwing2 (talk) 22:26, 18 April 2024 (UTC)
Thank you for doing that, as it provides a lot of information for the reader, and the etymology section is a lot more expansive than the headword line. But honestly what I wanted most was to get these terms into a category, such that a curious reader could browse through all of them at once, and using the etymology section won't do that .... we could put a template within the etymology section that would, but then that template could just as well go into the definition line. I dont want to say "no thanks" because it's up to the community, not to me ... but i think we should do something, and ideally I'd like to have the INN terms listed in a category. Best regards, —Soap—17:20, 21 April 2024 (UTC)
@Soap: Yes, I agree that it would be good to have a category for these spellings. How feasible would it be to have a specific etymology template do this? I envisage something like {{INN respelling|language code|original spelling}}. {{INN respelling|en|besylate}} and {{INN respelling|en|cylexethyl}} would generate the etymologies as I wrote them whilst also adding the pages to Category:English INN respellings; I would imagine that automatic string analysis could be used to specify which substitutions had taken place (y → i for besylate → besilate; y → i and th → t for cylexethyl → cilexetil). 0DF (talk) 17:35, 21 April 2024 (UTC)
well, i liked my idea better ... the idea of putting the information on the definition line like we do with other alt forms. I dont know if it really should be considered a separate etymology. but i dont want to put up a fuss either. basically i was hoping other people here would have something to say. one thing i can add, as an unrelated point, is that not all of this is just about spelling .... for example, the INN uses levmetamfetamine whereas the common English name of the drug is levomethamphetamine, with an extra -o-. Indeed the lev- prefix seems to substitute for levo- everywhere, which i find odd but it is what it is. —Soap—22:04, 27 April 2024 (UTC)
@0DF Yeah I suspected what User:Soap says, that this isn't necessarily something auto-computable; and I don't think we need a separate template, I think a label would be enough. Benwing2 (talk) 23:19, 27 April 2024 (UTC)
I have some concerns about the current classification system for Regional Hokkien. The system appears to be based more on administrative divisions than on linguistic, particularly phonological, relationships. For example, the current classification treats Tong'an dialect as a branch of Xiamen dialect. While administratively Tong'an District is part of Xiamen City, the narrowly-defined Xiamen dialect (specifically the dialect of Xiamen City center, which is spoken in the southwest of Xiamen Island) actually belongs to the Zhangdong branch (漳東腔), differing from the true Tong'an dialect, which is used in Tong'an District, Xiang'an District, and Kinmen County.
If we were to simplify and categorize Hokkien into just three dialects—Quanzhou, Zhangzhou, and Xiamen—based solely on major geographical areas, it might be feasible. However, treating these as three distinct major divisions fails to accurately reflect the relationships among various other dialect points.
To address these issues, I propose adopting a new system based on the Dialectal Atlas of Southern Min (閩南地區方言地圖集) by Professor Ang Uijin. This system classifies the core Southern Min dialects into eight branches: Tong'an, Quanhai, Quanshan, Quanzhong, Zhangdong, Zhanghai, Zhangnan, and Zhangshan.
The first four can be collectively referred to as "Quan dialects" (泉系方言), and the latter four as "Zhang dialects" (漳系方言).
Additionally, considering geographical, historical, and political factors, Taiwanese Hokkien could be established as a separate branch, with its internal dialects also categorized under these eight branches or their sub-branches. For example, Lukang Hokkien could be categorized under the Quanzhong branch.
This new classification system would not only allow for a more precise relationship between the dialects but also maintain expandability for future adjustments. We do not need to immediately add all sub-branches, but I believe this approach could significantly improve the way Hokkien dialects are related and classified in Wiktionary.
The proposed classification system (excluding Taiwanese)
I would appreciate feedback on the proposed classification system from other editors. If you have a moment, please take a look and share your thoughts.
@TongcyDai I am probably at least partly responsible for the current system; I've been trying to clean up the various Chinese lects and what we have results from what was there before (even messier) along with some changes I've attempted to make based on Wikipedia. I have no attachment whatsoever to the current system and would welcome some cleanup from someone who better understands the dialect situation. Also pinging @Wpi, ND381 who might have thoughts. Benwing2 (talk) 21:02, 17 April 2024 (UTC)
@Benwing2 I sincerely appreciate the substantial efforts you have dedicated to organizing and refining the classifications of Chinese lects on Wiktionary. It's clear that such a task is complex and challenging, and the progress achieved, even though not yet perfect, has greatly improved upon what was previously in place. Your dedication to enhancing these entries is invaluable to the community.
Thank you for your openness to new approaches and improvements. While referencing Wikipedia has been a good starting point for improving the Hokkien classifications, I have noticed some discrepancies and occasional inaccuracies across various aspects. This observation has inspired me to propose some adjustments based on specialized academic works, aiming to further refine our classification system on Wiktionary. --TongcyDai (talk) 06:44, 18 April 2024 (UTC)
@TongcyDai Of course. My only comment would be that I am generally in agreement with Justin that some of the intermediate nodes might not be needed, as they are generally the most controversial. Benwing2 (talk) 07:24, 18 April 2024 (UTC)
@TongcyDai Thanks for pointing this out. I am generally in favour of what you propose, although we might not need to go as detailed as the 片 level from Ang Uijin. I am also curious to know if there are better English translations of these names; are they taken from Ang or translated by yourself? — justin(r)leung{ (t...) | c=› }02:25, 18 April 2024 (UTC)
Thank you for your supportive feedback. I agree that while we may not need to delve into the level of detail as outlined by Ang Uijin, maintaining a broad framework would indeed be beneficial.
Regarding the translation of the dialect names, I translated them myself with Hanyu Pinyin, following Wiktionary's conventions, as no English equivalents were provided in the original Dialectal Atlas of Southern Min.
The atlas, including its maps and appendices, does not provide translations but does assign codes to different branches and dialect points, such as the Zhanghai dialect ("Jc"), Zhangpu subdialect ("Jc1"), and Qianting sub-subdialect ("Jc1.3"), where "J" presumably stands for Zhangzhou and "c" might denote coastal, although the specific romanization scheme used is unclear.
Additionally, the use of identical names at different hierarchical levels, such as Tong'an dialect (同安腔方言), Tong'an subdialect (同安話), and Tong'an sub-subdialect (同安腔)—with only their Chinese suffixes varying—presents a challenge for listing in Wiktionary. This overlap necessitates careful consideration of how to name these similarly titled branches to ensure clarity and adherence to Wiktionary's standards. If we need to classify these categories in such detail, it is crucial that we develop a systematic approach to differentiate and label them appropriately. --TongcyDai (talk) 07:28, 18 April 2024 (UTC)
@TongcyDai What is the difference between the three levels of Tong'an? If this is related to the dialect of an urban core vs. a larger grouping, one possibility is to use qualifiers like "Urban ...". This is what we've done with "Urban Shanghainese Wu" vs. just "Shanghainese Wu". Contrarily we also have both "Beijing Mandarin" to refer to the dialect of the city of Beijing and "Beijingic Mandarin" to refer to a larger grouping that includes Beijing and several other dialects, although I'm not entirely happy with this naming. Benwing2 (talk) 22:23, 18 April 2024 (UTC)
If Ang Uijin is a peer-reviewed authoritative source, then adopting his classification is a reliable decision. I cannot provide any feedback regarding the internal subclassifications. Since many Taiwanese dialects tend to lean towards Zhang or Quan, or a mix of the two in terms of pronunciation, would it be possible to find a way to integrate them among the current classification or does the unique vocabulary of Taiwanese constitute keeping all of them in a different branch? Kangtw (talk) 08:33, 19 April 2024 (UTC)
@TongcyDai The well-known three-way split is often unhelpful, even if applied sensibly. But does an Ang-based eight-way split add clarity?
First off, a question that should not go unanswered is why dialect classification should be based on phonology alone.
Second, what is the benefit of labelling (say) an Amoy form “Zhangdong” instead of simply “Amoy” (or “central Amoy”, etc.)? Cities, towns & villages are much more objective & unbiased points of reference; “Zhangnan” or “Quanhai” are processed. Ang’s scholarship is (very) insightful, but the reality of dialect variation is messy. A fancy categorization scheme would seem to add to the mess.
Third, “overseas” (incl. 浙江) dialects of Hokkien are not being addressed, incl. the Penang-Medan dialect, arguably the only dialect of Hokkien that’s not dying. Ironically, the reason why ASEAN dialects of Hokkien are typically excluded from such discussions (while Taiwanese is often included) seems to be that they’re not spoken under Chinese (Tionghoa) administration.
All this said, it definitely makes no sense for 同安 Hokkien to be classified under “Xiamen”. Maybe we should first try to understand how that misclassification even came about, and when.
A very real problem is that our editors (some of whom don’t even speak the language) in aggregate seem to have scant access to all but a few Hokkien-speaking community-locales anyway. And of course that’s okay. But there’s no need for us to pretend to have the entire Hokkien-speaking world covered.
Here is an article Prof. Âng Ûi Jîn himself posted on Facebook in 2021 that suggests how futile it is to box Hokkien into, say, eight phono-dialects. Pay special attention to discussion of the so-called 漳山 dialect or accent and the form given by Lîm Kiàn Hui.
And also a quote from another thread in this month's Beer Parlour that gets at another facet of the problem:
"We don't have a good factual basis for maintaining relatively fine distinctions, so a broad label that is subject to criticism, but defensible, is probably better than narrow ones, which are also subject to criticism." 釆 (talk) 11:41, 24 April 2024 (UTC)
@釆 Just curious, do you have an alternative suggestion? We have to do *something* (although "maintain the status quo" counts as "something" :) ...). Note that it is possible to give a specific dialect more than one parent. This is done in the current scheme with Lukang, which is both a Taiwanese and a Quanzhou dialect. The first parent determines the breadcrumb trail displayed at the top of the page. Benwing2 (talk) 21:15, 24 April 2024 (UTC)
disambiguation of links to Wikipedia
If a Wiktionary page has multiple definitions where further reading on Wikipedia would be useful, maybe each Wikipedia link should be adjacent to that Wiktionary definition, instead of all the Wikipedia links getting shuffled into a separate section? For example (copy-paste from Wiktionary:Tea_room/2024/April#API):
I like this idea, because to my mind it is more useful, direct, and intuitive to users. My gut will not be surprised if other Wiktionarians dislike it. The Beer parlour ("for policy discussion and cross-entry discussion") is the correct place to propose it, rather than the Tea room ("for questions concerning particular words"). Quercus solaris (talk) 16:35, 10 April 2024 (UTC)
A follow-up thought: In my opinion, if Wiktionary adopts this idea, then it would be best to use significant case for the WP link, instead of sentence case, irrespective of WP's page titles being sentence case. I have a solid reason for favoring this approach, which I can share if anyone requests it. TLDR: it's better for the Wiktionary environment. Quercus solaris (talk) 01:10, 18 April 2024 (UTC)
Yeah, you’re right. I’m not in favour of this because Wikipedia links are not that important, and splitting them up means possibly putting too much information adjacent to definitions which some readers may not like. — Sgconlaw (talk) 04:45, 18 April 2024 (UTC)
Yes, too much clutter and noise, distracting from the definitions. How about a small ref-style Wikipedia icon which jumps to the wp link in the "Further reading" section? Jberkel07:46, 18 April 2024 (UTC)
I only voiced my agreement with the concept, not the design, has not been made for this place in the first place, and the IP has not implied to be final for under the definition lines. Fay Freak (talk) 22:12, 18 April 2024 (UTC)
Kind of, but it's closer to how internal reference links/footnotes are handled. Maybe MediaWiki references could actually be used, if their design can be customised (ex. show a little wp logo instead of , etc.) Jberkel22:33, 18 April 2024 (UTC)
OP here. i will try to read these discussions (April and October) soon, but my brain can't handle it right now.
i clicked the link after Einstein2's signed post. Does that automatically ping Einstein2? Does it automatically ping anyone/everyone who's posted in this subsection of the beer parlor? Now that others have posted, if i instead clicked the link after Sgconlaw's signed post (for example, if i wanted Sgconlaw to clarify that post, i wouldn't really be replying to anyone else), would clicking there ping the people who posted above (Quercus solaris, Fay Freak, Sgconlaw), but not the people who posted below (Jberkel and Einstein2)? i assume if i click instead, no one gets pinged?
i started this in Wiktionary's tea room (and should i go back there for this part?) because i was asking particularly about the API page, which includes the definition ]]]--sum of parts, but with each part having its own multiple definitions, leaving the reader to guess if API refers to a more difficultintroductory text on any basic concept or a substance (used to prep wood or metal for painting)catching fire, or any number of other mix-and-match combinations of definitions for each of those three initial words. So in this case, the link to Wikipedia did not distract from the definition, but more or less provided the definition. So no matter how Wiktionary decides to link to Wikipedia most of the time, i wonder if API merits a deviation from usual format. ;-) :-P
To be more accurate, you have to link to their user page. I can do something like "No, I'm not pinging anyone- why do you ask?, and as long as I sign it, it becomes a ping. The last part is what many people don't understand: it doesn't matter how you format it, if the signature isn't added in the same edit as the ping, it won't work. If you add a ping in later edits, it just makes it look like you pinged someone- no one receives the ping. The only exception is linking to the user page in an edit summary- that always works. Chuck Entz (talk) 05:28, 21 April 2024 (UTC)
indented translations: "Ancient" or "Ancient Greek"?
Common practice indents Ancient Greek and Mycenaean Greek translations under a Greek header, where the header line itself lists Modern Greek translations. Usage isn't consistent, however, on whether the lines read "Ancient Greek" and "Mycenaean Greek" or just "Ancient" and "Mycenaean". What should it be? Personally I lean towards "Ancient Greek" and "Mycenaean Greek"; I notice for example that Arabic lect translations indented under the Arabic header always spell out "Egyptian Arabic", "Moroccan Arabic", etc. rather than just "Egyptian", "Moroccan", etc. The general principle I would advocate is when the indented language is a full L2 language, use the full name of that language. This is consistent with how both Arabic and Chinese translations are handled. Benwing2 (talk) 01:07, 19 April 2024 (UTC)
when we do a search it reads the string from the template, so e.g. this search (great choice of word, i know, but i didnt want to waste time searching for something else) produces
μοτός is an Ancient translation of the word pledget ("small absorbent pad").
Instead of reading "Ancient Greek". Likewise the same type of search will turn up results like "---- is a Cyrillic translation of ----" because Serbian and some other digraphic languages have the indented forms just read "Cyrillic". If we could change this it would be good. —Soap—03:46, 19 April 2024 (UTC)
BTW I have modified my script to sort translations to automatically indent all varieties of Greek listed under the family tree at Category:Ancient Greek language under a Greek header, and to rename Ancient -> Ancient Greek and Mycenaean -> Mycenaean Greek (and similarly for Epic, Ionic, Doric, Aeolic, Boeotian, etc.). I haven't run it yet pending consensus that this is the right thing to do. Benwing2 (talk) 07:05, 19 April 2024 (UTC)
(Notifying Mahagaja, Sartma): User:ErutuonUser:-sche do any of you have an opinion about this? I notice that the use of "Ancient Greek" rather than "Ancient" is what the translation adder generates, and it's consistent with the handling of most other indented language sets, including Sami, Kurdish, Romani, Mari, Sorbian, Nenets, Arabic, Chinese, etc. The only exceptions where something other than a language name is specified in indented lines are (a) when a script is mentioned instead of a language name (e.g. underneath Serbo-Croatian, Mongolian, Old Church Slavonic, Javanese, Malay, etc.); (b) in some cases where etym-only varieties are indented instead of full languages (e.g. varieties of North Frisian, Ossetian and Albanian); (c) in the case of Bokmål and Nynorsk. Benwing2 (talk) 02:10, 21 April 2024 (UTC)
I agree with your general principle of spelling out full language names ("Ancient Greek" is better than "Ancient"), this also helps if anyone has the presence of mind to Ctrl+F and look for "Ancient Greek" when they can't find it in alphabetic order. I am not personally a fan of nesting translations at all, but I recognize that other people like it. To me, it seems unintuitive to call a language e.g. "Whatever" in its L2, and sort that L2 after Walloon and before Zulu, but then in translations tables, sort it under A — if we're not considering Whatever to be a dialect of Apache when it comes to having actual entries, it seems unintuitive to me to be subsuming it like a dialect of Apache in translations tables, and I have sometimes thought a translation was missing from a table (only to find it upon trying to add it) as a result of this. But I recognize that other people feel the opposite way and think sorting all Apachean languages under A (etc) is the more intuitive thing. - -sche(discuss)02:53, 21 April 2024 (UTC)
@-sche Yes, I am of two minds about this for the reasons you state. I think it's especially problematic when the L2 language name doesn't include the language-set name in it. I asked about this specifically in the context of Aramaic in the Grease pit discussion that prompted this: besides lects ending in "Aramaic", there's also "Mlahsö", "Turoyo", "Classical Syriac", "Hulaulá", "Hértevin", "Koy Sanjaq Surat", "Lishana Deni", "Lishanid Noshan", "Lishán Didán", "Senaya", "Classical Mandaic" and "Mandaic". I have never heard for example of Mlahsö and would have no idea that it's nested under Aramaic instead of found under M. (OTOH I assume someone who adds a Mlahsö translation or goes looking for one will know that it's a variety of Aramaic.) Benwing2 (talk) 03:33, 21 April 2024 (UTC)
@Benwing2: Given that "Greek" is ambiguous between Ancient Greek and Modern Greek (and presumably other varieties and chronolects of Greek), might it be worth labelling Modern Greek translations as "Modern Greek" and nesting them under "Greek" like all the other varieties and chronolects of Greek? I've noticed Sarri.greek specify "Modern Greek" in several of her edit summaries, so perhaps she has an opinion regarding this. 0DF (talk) 17:22, 21 April 2024 (UTC)
(If you mean renaming the language everywhere,) On one hand, this would also solve the problem (discussed further up this page) of people not realizing "Greek" means the modern language, and so writing that this or that Coptic term derives from "Greek". On the other hand, "Greek" is clearly a modern more common name for the language than "Modern Greek", and "Modern Greek" would be a weird header to give long-obsolete terms from the early end of the time period el covers (centuries ago). On a balance, I don't think it's a good idea. (If you mean only renaming the language in translations tables, that still seems confusing, to have two names for the language in different places, and (again) to be labelling obsolete old terms as "Modern".) I think we're just stuck with some confusion. (It's certainly not the worst such confusion, compare cases where two languages are both called the same name, e.g. Riang, but one is sometimes also spelled Reang, so we call that one Reang and the other one gets exclusive use of Riang, probably confusing anyone who wants to add terms in Reang but is familiar with it being Riang and so adds the terms as Riang...) - -sche(discuss)17:55, 21 April 2024 (UTC)
@-sche: I meant only in translation tables, although I note that το Βικιλεξικό uses Νέα ελληνικά(Néa elliniká, literally “New Greek”) for its el language header. IMO, it wouldn't be nearly as confusing as you suggest, since Modern Greek translations are currently given at "Greek"; they would still be given at "Greek" if we nested "Modern Greek" there, just without the Ancient/Modern confusability. To your other point, very many modern things are obsolete, and not just in language; remember that all of Shakespeare is (Early) Modern English. 0DF (talk) 21:52, 21 April 2024 (UTC)
Language code is manually specified instead of guessed based on the year, allowing for more flexibility
Term transliteration and formatting takes advantage of Module:links, giving us automatic transliteration, |t=, and |alt= for free.
Volume, page number, and url have been added as optional parameters, allowing for more specificity
Work "presets" are defined as their own templates, making them much easier to add and customize
See Module:ko-etym for the maintenance burden posed by the previous strategy of putting everything in the module
As an example of customization, Template:User:Lunabunn/ko-attest/YB uses the 세종한글고전 database to automatically link to the relevant page or volume; this kind of thing would have been impossible or very difficult under the previous model
It does not automatically categorize the term as native Korean or derived/inherited/... from the source language
Especially for editors,
Work title/year/... formatting is handled by the module, making it much easier to cite works without dedicated templates
Only the language code is required with both the term and work being optional
This reduces the burden of finding a first attestation when you only know the Middle Korean form (from e.g. a dictionary)
Incomplete invocations such as these can be tracked using Module:debug/track
We can finally treat derivation separately from attestation. It is often the case that the attested form is not a direct ancestor of the entry form, and we can make this clear.
Given below are some example etymologies for words that are currently problematic:
{{ko-attest|ko-ear|부럽다|...}}, probably from an earlier {{inh|ko|okm|nocat=1|-}} {{com|okm|nocat=1|블다|-어ᇦ-|pos2=adjectivizer}}.
Equivalent to now-obsolete {{dbt|ko|notext=1|븗다}}, first ] as {{ko-attest/...|okm|h=none|븗다|...}} from {{com|okm|nocat=1|블다|-ㅸ-|pos2=adjectivizer}}.
{{ko-attest/...|섧다|...}}.
The origin of this particular form is {{unc|ko|nocap=1}}. By analogy with words like {{m|ko|즐겁다}}, one may reconstruct {{com|okm|nocat=1||-어ᇦ-|alt1=*셜-|pos2=adjectivizer}}, but the existence of such a verb is dubious. Instead, it may simply be the result of reanalysis by analogy.
The origin of this particular form is uncertain. By analogy with words like 즐겁다(jeulgeopda), one may reconstruct *셜(*syel) + 어ᇦ(-eW-, adjectivizer), but the existence of such a verb is dubious. Instead, it may simply be the result of reanalysis by analogy.
This topic DOES intend to:
Propose bringing this new template into main space for us to start using
Encourage discussion about the benefits and shortcomings of this new template
Acknowledge the need of separating attestation and etymology, hopefully providing a foundation for seamless transition into structured etymologies in the future
@Lunabunn: The work on providing references is great. I am not familiar with sources and how to use them but it seems someone needs, at least to make some effort and find the entry. My original objection to simply remove {{ko-etym-native}} was only when no replacement is offered. Of course, the more detailed the reference is, the better! Thanks.
Most sources are going to lack English translations, unfortunately, but at least providing the original image as I have with YB seems to be a good addition.
As for the 신증유합 example, (as with most other "high-profile" texts,) scans are available online. The issue with linking to individual pages may be hosting, as many scans are not directly linkable as are Sejong DB's; Chom.kwoy might be able to shed some light here.
In addition, I will note that the |url= parameter is optional & all URL-related work is done in the work template. See Template:User:Lunabunn/ko-attest/YB. For works where we cannot get per-page links, we always have the flexibility to link instead to the entire PDF (or not link at all). Lunabunn (talk) 06:45, 19 April 2024 (UTC)
@Lunabunn: You have just LIED that "As per the discussion at Wiktionary:Beer parlour/2024/March#Template:ko-etym-native without parameters is pointless and misleading". It was pointed out that it is being used to record that the term feels 'native', a feeling that is not rigidly tied to the true etymology. Unfortunately, there has been no agreement that I have seen on a method of recording that. --RichardW57m (talk) 15:47, 19 April 2024 (UTC)
Like I said, I never called for ko-nat to be deprecated, nor did I claim that we reached a consensus. All I said was that I have created a new template as per the discussion---which is true---explicitly noting that further action will require more discussion. Your comment is thus explicitly outside of the scope of this thread.
Please do not resort to childish attacks such as calling me a "LIAR" solely based on your unfortunate assumption that everyone who doesn't personally agree with you must be malicious and deceitful. Lunabunn (talk) 16:13, 19 April 2024 (UTC)
That being said, for additional context that I have already provided in the original thread:
ko-nat doesn't print a "Of native Korean origin" message unless it is invoked incompletely without the attestation parameters. In such cases, this template would not be used anyway (as there is no attestation to speak of).
The only other difference is categorization, which editors can add manually (or, for the time being, even using ko-nat alongside this template). There is absolutely no difference from the user/reader's point of view. This template merely provides nicer formatting and more flexibility for editors. Lunabunn (talk) 16:20, 19 April 2024 (UTC)
@Lunabunn: I called you a liar because you were lying. As you have now noticed, the discussion did not conclude that it was pointless. If you quote unestablished assertions as accepted when they were challenged, expect to be called a liar. --RichardW57 (talk) 20:17, 1 May 2024 (UTC)
@RichardW57 I explicitly said it was NOT pointless (as thus far discussed) and it will NOT be deprecated. I am truly sorry that you do not seem to want to read what I write, but please do not take that as an excuse to stoop to unfounded ad hominem, especially about an editing community and a language that makes up zero of your 18K+ edits here on Wiktionary.
Further, until you can contribute valid, constructive, and relevant feedback, please understand my refusing to engage. Lunabunn (talk) 00:02, 2 May 2024 (UTC)
@Lunabunn I've just reread this discussion. I have difficulty distinguishing the colour of visited links from black, and I mistook part of the title for the text of the comments. I withdraw the charge of lying, and apologise for my eyesight. --RichardW57 (talk) 06:10, 2 May 2024 (UTC)
I personally really like how this template is laid out. Template:ko-etym-native was already very hard to maintain, very rigid in how it works and just generally difficult to edit, at least for me. As I already mentioned and/or agreed on the aforementioned template's beer parlour thread, it made unnecessary and uncalled judgments and there was no good way around it. Besides, the name was pretty misleading to a lot of editors, in my opinion. I think this template just solves most of the problems associated with it. Plus, it is called exactly what it does, mentions Korean attestations. I'm sure we will need to discuss some stuff mostly about picture URLs but for now it should work. I was (and still somewhat am) of the mind that first attestations in Modern Korean's etymology section is unnecessary, however until Middle Korean entries become sufficiently good for templates like this (which, who knows when it's gonna happen), I think it'll do just great. - Solarkoid (talk) 18:06, 19 April 2024 (UTC)
I am not an editor for KO entries, aside from minor maintenance stuff, as my Korean capabilities are very much at the "beginner" level currently. That said, from the perspective of a reader of KO entries, including occasionally the wikitext, I think this proposed template looks good, both in terms of output and wikitext. And as others have also noted, it "does what it says on the tin", unlike Template:ko-etym-native. +1 from me. 😄 ‑‑ Eiríkr Útlendi │Tala við mig22:17, 19 April 2024 (UTC)
Given that it has been two weeks and no opposition has been raised, I have gone ahead and created {{ko-attest}}. @ Korean editors, it would be much appreciated if you could try out the new template and leave any feedback in Template talk:ko-attest. Thank you! Lunabunn (talk) 22:15, 3 May 2024 (UTC)
Not something I'm an expert in, but reversions like this justify more explanation than "POV and other problems", and I see many, many similar removals by @Sarcelles. I also see Sarcelles pointing to various Wikipedia talkpages in various edit summaries, too, but these are wholly irrelevant to Wiktionary. Theknightwho (talk) 20:46, 20 April 2024 (UTC)
@Sarcelles: Wiktionary is a dictionary, so quotes are used to show usage, not verify facts. The most wrongheaded and vicious lie is fine as a quote as long as it uses the term in the spelling and with the definition being illustrated. Chuck Entz (talk) 21:14, 20 April 2024 (UTC)
Although lest anyone get the wrong impression of our policies and practices, a quote with vicious misinformation is better on the Citations page than in an entry! and if there are enough other quotations to verify the sense, may not be needed at all. In the case of linguistic families, I'm sympathetic to the point of view that if we had, for example, an old book somewhere which erroneously included Hungarian in a list of Slavic languages, "The Slavic languages include Russian, Polish, Serbian, Hungarian, Slovene, and Bulgarian.", this might not actually be useful to us as a citation of Slavic, and/or factual considerations of whether Hungarian was actually Slavic might outweigh the existence of that cite, when it came to deciding whether to label a word as "Slavic, specifically Hungarian" (to give a parallel to some of the entries being discussed here). (But if there were many such cites using a different-than-usual sense, as may be the case with some of the Low German / Niederdeutsch cites being discussed, that would merit a separate sense line, yes, like obsolete taxonomic categories, etc.) - -sche(discuss)22:03, 20 April 2024 (UTC)
It's not an issue about East Bergish, but about Ostbergisch. The English term is a protologism (failing WT:CFI), while the German term has three usages (hence passing WT:CFI).
Ostbergisch is not a universally used term, instead it's rather uncommon (cp. the label), but it exists (at least three usages, passing WT:CFI).
This stuff isn't in line with the quotations (e.g. 2021: "Ostbergisch, einer niederfränkischen Mundartgruppe"), and instead is rather defining Bergisch, another term and a totally different thing.
This should be quite self-explaining: The entry lists different senses of the term in question, some senses being old but attested and illustrated by older usages. Some quotes were removed. At first they were removed as being "doubtful quotations" - but they aren't doubtful, they can be verified with books.google.com. At second they were removed as being "ages old". But WT also covers "old" (obsolete, archaic, dated) and other (e.g. technical, uncommon, rare, offensive) usages/senses. And "old" usages/senses are usually attested with old quotations which were provided to show the existence (cp. WT:CFI about attestion etc.).
--00:11, 21 April 2024 (UTC)
I doubt that either of the IP author or me knows the rules. How many quotations are advisable? What is done with obsolete concepts? Sarcelles (talk) 04:20, 21 April 2024 (UTC)
@Sarcelles: Well to put it bluntly, you are even supposed to have a POV, being a currently informed observer of past matters. Being a secondary source, Wiktionarians have some leeway to be creative and combine different views: notably w:WP:SYNTHESIS does not apply, which enables Wiktionary to have the most correct, balanced, perspectives on word histories, in comparison against references available. We don’t write biographies of living persons, all is in the linguistic material, every statement based on accessible corpora, right? That’s why we afford this approach. So you don’t have to start from references containing the language names in the first place but base glosses helpful to readers on your experience of usage, your general impression, which of course does not preclude additional research with the purpose of creating entries.
Clinging to the understandings of previous references does not even pay justice to the diachrony and synchrony distinction, which reflects in distinguishable styles and orders of glosses: some sort their word explanations by most common to rarest, others chronologically to give better credit to etymology which too needs to be written, others logically (as field, which otherwise is unreadable), depending on how they felt.
If you are into psychobabble (I’ve recently grown fond of): intersubjectivity, being the actual goal rather than objectivity because our human readers, as social animals or neurotypes more interested in social proof and framing rather than pure reason anyways, requires greater mentalisation capabilities in the form of the editor being able to interpret what was behind an author’s utterances when he wrote a quote. We cannot go without our personal interpretation of the information available to us, for talking to other people—whenever you are not only speaking to yourself in the shape of private language—involves the hot potato of theory of mind, which we may be more or less explicit about: Talk:skoliosexual. Fay Freak (talk) 08:55, 21 April 2024 (UTC)
I've been following the debate between Sarcelles and the Paderborn-based IP on enWP for quite some time (and also actively participated in some of the discussions). Sarcelles has been kind enough to bring to my attention that they also extend their bickering to various other Wikimedia projects, including English Wiktionary.
First of all, @Sarcelles, keep in mind that Wiktionary is about words, while Wikipedia is about things that pass the threshold of general notability. So e.g., Ostbergisch might be a contentious concept; for my part, I believe it is a useless label for an arbitrary residual artefact of rigid inclusion criteria (= Uerdingen line, Benrath line, en-Einheitsplural) for four surrounding major dialect groups (Kleverlandish, South Low Fraconian, Ripuarian, Westphalian). But that's largely irrelevant for Wiktionary. The term is a neologism coined by German dialectologist Georg Cornelissen. Because of his affiliation to the LVR (Landschaftsverband Rheinland), this term has received an enormous online boost and consequently has gained some prevalence outside of Cornelissen's bubble in quite a few printed texts. So it clearly deserves an entry here.
The only things that need to be discussed here are:
Does the label "uncommon" sufficiently capture its usage?
Is the definition ("A Low Franconian variety, spoken in the German state of North Rhine-Westphalia") too much in-universe? We don't define "fish" as "A biological taxon", but with a simple descriptive definition. What about "A label for dialects spoken in Bergisches Land in the German state of North Rhine-Westphalia", maybe with direct attribution to Cornelissen?
As for the Paderborn IP entries, I can see a general problem with them being very ambitious in trying to capture all possible of definitions of highly fluid terms like Niederrheinisch in German dialectology. Currently, we have four definitions in Usage notes. The second one is overly detailed and as a result wrong; Wenker and Wiesinger have used the term for quite similar concepts, but Wiesinger didn't see the Uerdingen line as the southern demarcation for Niederrheinisch, but rather the Akzentgrenze. And worst of all, the most common defintion is lacking: most traditional dialect in the Niederrhein area are moribund of extinct; for most people, Niederrheinisch refers to the regional Umgangssprache of that area which is markedly different from colloquial Rheinisch (as spoken around Cologne) and Ruhrdeutsch. Bluntly speaking, I think that the IP's ambitions are occasionally not matched by sufficient familiarity with the relevant literature, leading to the odd inaccuracies and lacunae in the Usage notes of Niederrheinsch (and potentially other entries). –Austronesier (talk) 10:22, 21 April 2024 (UTC)
"for my part, I believe"
I for my part prefer NPOV, it doesn't matter what I or some other WT editors think, believe or prefer: others obviously didn't share this believe, used Ostbergisch, and the term is sufficiently attested.
"Wiesinger didn't see"
Sometimes terms are used without giving a clear definition first, so Wiesinger (1975/2017) could have the term without a clear definition.
Wiesinger (1983, in Dialektologie vol. 2, HSK 1.2) has e.g. the following (which indeed could lead to another sense):
p. 859: "Das Niederfränkische am Niederrhein Niederrheinisch oder Kleverländisch bezeichnet. Obwohl es mangels der Lautverschiebung als „niederdeutsch“ bezeichnet wird, .
p. 856 ("Karte 47.10"): Here several "Niederfränkisch-ripuarische Strukturgrenzen" are given.
In the given quote, Wiesinger (1975/2017) refers to Goossens and the Ürdingen line: "GOOSSENS erwies sich auch ihm die Ürdinger Linie als die sprachliche Hauptscheide gegen das Westfälische im Osten und das Niederrheinische im Norden."
"A label for "
That's a start of a rather bad definition like "fish: a word/term for cold-blooded vertebrate animal that lives in water". (Cp. e.g. de.WT's Einsetz-Probe.)
"And worst of all, the most common defintion is lacking: Niederrheinisch refers to the regional Umgangssprache of that area which is markedly different from colloquial Rheinisch (as spoken around Cologne) and Ruhrdeutsch."
de.WP doesn't have that sense as well. Mundarten des Niederrheins", and calls the regiolect which developed from the dialect(s) "Niederrhein-Deutsch". Though, albeit no surprise, it lacks several senses which are attested in en.WT.] Feel free to add this sense, but please don't forget quotes.
--21:13, 22 April 2024 (UTC)
I for my part prefer NPOV, it doesn't matter what I or some other WT editors think, believe or prefer Not the first part of the sentence, but the latter part of the sentence is a quite good description of the attitude of this user. What's more, this user seems to confuse WT with a talk page of German POV, German language, German manners and German quality. They are piling up arguments in favour of their views excessively.Sarcelles (talk) 05:37, 23 April 2024 (UTC)
it isn't, term is ambiguous (broader and stricter senses), one can easily find quotes for "niederrheinische Dialekt" (etc.), even when calling a speech-form dialect there can be sub-dialects (like (real example) "Der niederrheinische Dialekt theilt sich in folgende Unterdialekte")Sarcelles (talk) 08:08, 3 May 2024 (UTC)
A question to this user: Why are you clinging on those opinions, even if they are refuted? There is always a usage of the term to be found, which fits your opinions. Sarcelles (talk) 21:54, 7 May 2024 (UTC)
This is a rare appropriate reaction by you. WP:NPOV is a challenge you should ackonwledge. one swallow does not a summer make. is something I advise you to start to follow.
This vandalism page on de-Wikipedia has been closed by an administrator and resulted in a one day block for MicBy67. I have been removing or replacing maps by the already mentioned registered user of numerous user names in many projects recently. Sarcelles (talk) 11:10, 30 May 2024 (UTC)
I can only weigh in with original research and personal opinion here. In the eastern part of the Bergish area there is/used to be a local language situated on the border between southern Westphalian, which already is fairly idiosyncratic and southern (e.g. has some proto-High German features like using 'twerg' for dwarf as an example of a regular shift), and the dialects west of it. In terms of the features I know of, this eastern Bergish dialect is in a part of the dialectal continuum so, well, continuous, that it's not really possible to qualify whether it's a really Limburgish Low German or extremely Low German Limburgish. So I'm aware of East Bergish existing in that regard. Since I know it exists and I can't classify it clearly as part of anything else, I'm not inherently opposed to judging it 'existing'. But I have no scientifically valid opinion to contribute. Korn (talk) 22:32, 9 June 2024 (UTC)
Comment added in diff at 06:57, 13 June 2024 (UTC).
Hi, I added the indication of time for you. As for the out-of-process removal of senses, as in &&, please see criteria for inclusion (CFI) and requests for verification (RFV) (as already mentioned in e.g. ). Wiktionary is a descriptive dictionary, chiefly relying on quotations with usages of terms. If a term/sense is doubted, a request for verification can be created. Then either the term/sense will be attested, or it will be removed as the result of the proper process. You're welcome. --2003:DE:3717:7150:29E7:48F6:CCD9:8E7B07:24, 13 June 2024 (UTC)
As this is a cross-wiki problem, it should be discussed in other projects.
In de (standard German) Wikipedia, another user reverted two edits by 2003:de:3717:7198:d990:2ed8:ad58:a3df, being not the only contact. Revert: use discussion page possibly, WP:WAR end
Well, reverts like , , (for the last one cp. & ) speak for themselves, and here it can be seen very fast and quite easily: Simply click on the link "Ägypten auf Adherents" and one will see that it's unrelated to Egypt and Christianity (both Protestantism and Catholicism), instead there's unrelated stuff like "What is your ideal Disney prince according to your zodiac sign".
Related to the above topic on whether Ancient Greek should be indented as "Ancient" or "Ancient Greek", there are several cases where the existing practice is equivocal as to whether to indent. MediaWiki:Gadget-TranslationAdder-Data.js contains a list of languages to indent, but some others tend to be indented as well. Some questionable cases:
German: The translation adder data puts Alemannic German, Kölsch and Palatinate German indented under German but not other German varieties. (Not to mention that it puts German Low German and Dutch Low Saxon indented under Low German instead.) The stats bear this out to some extent: Alemannic German is indented 795 times vs. non-indented 127 times, whereas Bavarian is indented 33 times vs. non-indented 452 times, and Pennsylvania German is indented 32 times vs. non-indented 355 times. But I don't see why Alemannic German should be treated differently from Bavarian, Pennsylvania German, East Central German, etc. What should be done? A fuller table looks like this:
Language
Indented
Non-indented
Alemannic German
795
127
Bavarian
33
452
Central Franconian
78
72
Cimbrian
2
144
East Central German
8
69
East Franconian
2
3
Kölsch
1
1
Luxembourgish
2
3476
Middle High German
55
43
Mòcheno
1
362
Old High German
47
169
Palatinate German
0
0
Pennsylvania German
32
355
Rhine Franconian
49
45
Swabian
7
81
Swabian German
3
0
Vilamovian
0
408
Low German: The translation adder data says to indent German Low German and Dutch Low Saxon, but not Middle Low German. In reality, it's 17 indented Middle Low German vs. 31 non-indented.
Greek: The translation adder data says to put Ancient Greek and Mycenaean Greek under Greek. Presumably other ancient varieties get indented too. But what about Pontic Greek, Mariupol Greek, Tsakonian, etc.? Currently it's 7 Pontic Greek indented vs. 86 non-indented, 0 Mariupol Greek indented vs. 14 non-indented, 3 Byzantine Greek indented vs. 1 non-indented, 5 "Cappadocian" indented vs. 12 "Cappadocian Greek" non-indented.
Persian: The translation adder data says to indent Iranian Persian, Classical Persian and Dari, but not Middle Persian or Old Persian. In reality, it's 27 indented Middle Persian vs. 256 non-indented, and 15 indented Middle Persian vs. 84 non-indented.
Apache: The translation adder data says to indent Western Apache, Jicarilla and Chiricahua, but not Plains Apache or Lipan (probably an oversight). In reality, it's 2 indented Lipan vs. 6 non-indented and 2 indented Plains Apache vs. 9 non-indented. Note also that Navajo is on the same level phylogenetically as the other Apache varieties but is almost certainly excluded intentionally.
Irish: The translation adder data says to indent Old Irish and Middle Irish but not Primitive Irish (almost certainly an oversight). In reality it's 5 to 5 for Primitive Irish.
Khanty, Mansi, Rusyn: All recently split. Based on other patterns, probably they should all be indented but they're not mentioned in the translation adder.
Nenets: Not mentioned in the translation adder. Should Tundra Nenets and Forest Nenets be indented? Currently it's 3 Tundra Nenets indented vs. 56 non-indented and 2 Forest Nenets indented vs. 10 non-indented.
Nahuatl: Not mentioned in the translation adder. Should the various Nahuatl varieties be indented?
Malay: Not mentioned in the translation adder. Should the various Malay varieties (Brunei Malay, Ambonese Malay, Baba Malay, Pattani Malay, North Moluccan Malay, Manado Malay, etc.) be indented?
Indent all German except Ausbausprache Luxembourgish (we don’t indent Maltese either). Indent all Greek. I guess indent Apache due to terminology difficulties (as Aramaics). Of course chronolects of Irish. Indent all Nahuatls, I have no doubt, and Malays. Fay Freak (talk) 08:05, 21 April 2024 (UTC)
Don't indent Old (or Middle) High German IMO, I look for that under O (and M), just like Old English which we also don't indent. Don't indent Luxembourgish, which, as the proverb goes, has its own army. I am sceptical of the helpfulness of nesting Bavarian or Alemannic German, but maybe editors who add those languages can weigh in on what they would find helpful vs unhelpful. I am very ambivalent about indenting the rest. To me it seems normally unhelpful / unintuitive to take a language that we sort alphabetically when it comes to L2s, and that we indeed give its own L2s in recognition that it is not merely a dialect of something else, but then sort it like a dialect in translations tables; to me it only seems intuitive when it's a situation like Chinese where we also sort all the languages under one L2. (I am neutral about nesting Arabic, where at least all but one(?) of the nested languages have "Arabic" in the name and are probably thought of as dialects by many people.) But I recognize that other people find it more intuitive, for whatever reason, to sort L2s and translations differently. - -sche(discuss)15:09, 21 April 2024 (UTC)
Error: There is no English to indent under. Otherwise I would suggest to indent Old and Middle English under it. Fay Freak (talk) 17:59, 21 April 2024 (UTC)
Interesting point! I mentioned Old English because there's a place in the code where ang and enm have specifically been commented out of being nested... I guess at some point in the distant past someone put them under an empty "English:" header. (Anyway, I oppose nesting Middle or Old High German, and am inclined to oppose nesting the others; it is unintuitive to sort "Old Norse" under "O" but "Old High German" under "G", IMO.) - -sche(discuss)18:10, 21 April 2024 (UTC)
@-sche FYI there are two Arabic varieties (not counting Maltese) that don't have "Arabic" in their name: Hassaniya and Nubi. Nubi is a creole (although Juba Arabic is also a creole but has Arabic in its name, presumably because Juba by itself is a city), but I don't know why Hassaniya doesn't have Arabic in its name (cf. the Wikipedia article Hassaniya Arabic). Benwing2 (talk) 21:00, 21 April 2024 (UTC)
FWIW It was proposed to add "Arabic" to the name of Hassaniya at WT:RFM#Renaming_mey for consistency back in 2017, if anyone has time to make the rename. (I would suggest checking with anyone editing the language first to make sure they're still on board with a rename, but no one seems to be editing the language; our few entries seem to have been added years ago and the most recent editor to touch them was TongcyDai just helping with categorization and listing no knowledge of the language.) - -sche(discuss)21:58, 22 April 2024 (UTC)
Repeating what I wrote in #indented translations: "Ancient" or "Ancient Greek"?, given that "Greek" is ambiguous between Ancient Greek and Modern Greek (and presumably other varieties and chronolects of Greek), might it be worth labelling Modern Greek translations as "Modern Greek" and nesting them under "Greek" like all the other varieties and chronolects of Greek? Note that this would not move Modern Greek translations to a new, unintutive location, and it would prevent the conflation that Ancient Greek and (Modern) Greek often suffer. 0DF (talk) 01:56, 22 April 2024 (UTC)
//el:'Greek/Modern', don't nest Modern Greek (Atelaes)
So evidently this idea was considered and rejected. Maybe User:-sche has the history on this. (User:Atelaes has not been active for 10 years so this decision is very old at this point and could potentially be reconsidered.) Benwing2 (talk) 02:07, 22 April 2024 (UTC)
@0DF I think we'd probably have better luck searching through Beer Parlour or Grease Pit archives and examining the changelog history of MediaWiki:Gadget-TranslationAdder-Data.js. Probably the rationale is there somewhere. BTW -sche is especially good at digging up relevant old discussions. Benwing2 (talk) 02:40, 22 April 2024 (UTC)
@Benwing2: I tried searching through the Beer parlour, the Grease pit, and the user talk pages for Atelaes and Conrad.Irwin. The only somewhat relevant thing I could find was this post by Atelaes, which states “Greek is solely a super-heading, with no actual content (the content is in its subheading under ‘Modern’)”, which suggests that Modern Greek did use to be nested under "Greek", as I propose. It seems that such nesting caused problems for a Javascript gadget of his, User:Atelaes/TargetedTranslations.js. Could that be the entire reason? Does anyone still use that gadget? 0DF (talk) 03:51, 22 April 2024 (UTC)
@0DF Well, it's been ported to the MediaWiki space under MediaWiki:Gadget-TargetedTranslations.js, so it's become an officially supported gadget. But I doubt it still has issues with this because there are tons of nested translations. It's more a case then of what we believe the right thing to do is. Benwing2 (talk) 03:57, 22 April 2024 (UTC)
@Benwing2: I did some testing with ] and ]. AFAICT, the way to “elect preferred languages” using MediaWiki:Gadget-TargetedTranslations.js is completely different from that using User:Atelaes/TargetedTranslations.js: with the latter, one had to type the language name precisely; with the former, one toggles a star icon (⭐), which takes the place of the bullet to the left of each language. As with Atelaes’ script, if I select, for example, “Kurdish” (simpliciter) in the table in Middle Ages, it does't show me anything; however, if I select “Northern Kurdish” therein, it shows me “Kurdish: Northern Kurdish: Serdema Navîn(ku)” (following “historical period - ”). So yes, the issue with nested translations appears to have been resolved. All that being said, if the selected language is not nested in the way the gadget expects it to be, it will fail to display the selected language's translation, even if it exists. For example, if I select “San Juan Atzingo Popoloca” in the translation table in Middle Ages, wherein it is a stand-alone unnested language, it shows me “San Juan Atzingo Popoloca: please add this translation if you can” (following the Northern Kurdish translation + “; ”); however, for the first table in ] (liquid H₂O), the gadget only shows me “Kurdish: Northern Kurdish: av(ku)f ”, even though the table also contains a translation for San Juan Atzingo Popoloca (⁴nta⁴). The reason for this is that San Juan Atzingo Popoloca is nested under “Popoloca” in that table. I deselected “San Juan Atzingo Popoloca” in Middle Ages, added *: San Juan Atzingo Popoloca: {{t-needed|poe}} nested under * Popoloca: toDark Ages, and then reselected “San Juan Atzingo Popoloca” there; refreshing ], the gadget then showed me “Popoloca: San Juan Atzingo Popoloca: ⁴nta⁴”; refreshing ], the gadget fails to show me anything for San Juan Atzingo Popoloca. This, if anything, shows the technical value of consistency. The translations in ] are in many places listed contrary to the way they are currently listed by the translation-adding tool. Is there a bot that enforces the prescribed way of listing translations in translation tables? If so, can it be run on ]? 0DF (talk) 17:17, 22 April 2024 (UTC)
@0DF @-sche Hmm. I have a script I've been working on to indent languages that aren't indented but where the translation adder says they should be, but it currently doesn't go the opposite direction (i.e. unindent where the translation adder says not to indent) except in a few cases. If we are serious about being consistent with our indenting, we should probably fix the translation adder to know about the various cases identified in water/translations, e.g. Amuzgo, Chinantec, Coptic, Fula, Javanese, Kashmiri, Kipchak, Ladino, Mari, Mazahua, Mazatec, Me'phaa, Mixtec, Nahuatl, Otomi, Popoloca, Popoluca (are they different?), Talysh, Teke, Tepehua, Totonac, Zapotec and Zoque. -sche would probably disagree with all these cases; regardless I think we need a general principle concerning whether to indent or not rather than doing it in an ad-hoc, half-assed fashion. Then I can do a bot run to fix all the cases needing fixing. Even better IMO, however, would be to fix the targeted translations gadget to be less picky about whether a given language is indented or not. Benwing2 (talk) 21:06, 22 April 2024 (UTC)
Re "Popoloca, Popoluca (are they different?)": confusingly, yes—at least in theory. Scholars try to use -u- is for Mixe-Zoque languages, -o- for Oto-Manguean languages. The Mixe-Zoque Popolucas in particular are, as the name suggests, not even all in one subfamily, let alone intelligible with one another, and they only share the name "Popoluca" (together with the Popolocas) because that was the Nahuatl pejorative for "(non-Nahuatl) gibberish". (Despite this etymology, the speakers of the languages have often gotten used to the name and find academics' proposed clearer replacements unfamiliar; Lynda Boudreault's 2018 A Grammar of Sierra Popoluca devotes some pages to this.) The various things we nest really seem to run a gamut from "related and sometimes considered varieties/dialects of one language (whether correctly/defensibly or not)" (Arabic, Chinese) to "I guess people might group these for convenience?" to "not related, but the names sure do sound similar". - -sche(discuss)21:58, 22 April 2024 (UTC)
@Benwing2, -sche: I've given this some more thought. I have come to the conclusion that the most intuitive and only consistent way to list translations is not to indent those for languages which get L2 headers, and only to indent those for languages whose terms are treated as (for want of a better word) dialects of other languages (marked by {{lb|und|sense labels}} or howsoever else). After all, if a person is used to looking for (say) Hokkien terms under the “Chinese” language header, wouldn't that person then expect to find any Hokkien translations listed under “Chinese” in a translation table? And conversely, wouldn't a person who's used to the “Middle French” language header (rather than, say, “French (Middle)”, “French, Middle”, or “French: Middle”) be likely to search for a Middle French translation under “Middle French”, as opposed to under “French”? Thoughts? 0DF (talk) 19:20, 21 May 2024 (UTC)
Although we have a lot of users in CAT:User el, I don't think we've ever had a very active Greek-speaking editor community of more than a few users (our lack of active Greek speakers has been a memorable problem when Greek questions have come up), so my guess would be that Atelaes and whoever else was actively editing Greek at the time may have briefly talked about it (potentially even somewhere now-inaccessible like IRC, where some parts of our language-code 'naming' schema were also hashed out) and just decided they personally didn't want it nested, so as Benwing says, it's probably just a case of whether we personally now want it nested. FWIW, not nesting it seems consistent with most other languages, where the modern or main language is not nested (Arabic, French, etc). I am not inclined to nest it, personally, but (pending any general decision on whether to nest vs stop nesting things in general) would defer to the few active Greek users we have now, which seems to be ... Sarri and Omnipaedista and maybe someone else? - -sche(discuss)04:08, 22 April 2024 (UTC)
┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ @Benwing2@Sarri.greek I'm afraid that I haven't absorbed all the arguments above and, approaching semi-retirement, I'm neutral on the subject. New Ancient Greek users unused to Wiktionary will probably look under G — on the other hand uniformity with the majority of languages might make sense. So I don't have strong feelings on the issue (as indeed about whether Greek should be called Modern Greek.) — Saltmarsh☮05:30, 22 April 2024 (UTC)
@-sche Please take a look at User:Benwing2/analyze-indented-translations-20240420-dump. I analyzed the existing occurrences of indenting. The table sorts first by the number of times a header (e.g. Chinese, Arabic, Kurdish, Serbo-Croatian) occurs with at least one indented language under it (where an indented language could be any translation with any label, including things like "Cyrillic"). Under each header is listed (alphabetically) all languages that occur under it anywhere. To the right are the counts of how many times the language occurs under the header in question and then how many times the language occurs total (indented or not, and indented under any header). I think we need to establish a general principle for whether to indent a language, and I propose the following:
Whenever a language is of the form "Qualifier Macrolanguage" e.g. "Saterland Frisian", "Iraqi Arabic", "Tundra Nenets", "Western Mari", "Ancient Greek", "Middle French", etc., it gets indented under the macrolanguage.
The main advantage is that this makes it possible to quickly compare translations from similar languages. This, I think, is fundamentally why people like indented translations.
The main disadvantage (as -sche would put it) is that it makes it harder to locate a given language's translation. (While this is true, I think it's partly negated by having a consistent policy of when we indent; our current ad-hoc situation gives us the worst of both worlds.)
If an L2 language is missing the name of the macrolanguage, but (a) logically goes under it (which should maybe exclude pidgins and creoles), and (b) does not have its own "army" (as the proverb goes), it also gets indented. Hence, Mlahsö, Turoyo, Mandaic go under Aramaic and maybe also Classical Syriac (although this could potentially be said to have its own "army"), but not Maltese under Arabic or Navajo under Apache. The criterion about having its own army is an attempt to balance interest in a language primarily for comparative purposes, which justifies indenting it (and happens more with obscure languages) vs. interest in the language for its own sake, which justifies not indenting it so it's easier to find (which happens more with more well-known and well-represented languages).
So I propose the following:
Use the table I linked to above as a starting point to compile a complete list of all macrolanguages and individual languages to indent under them.
Update the translation adder appropriately.
Run a script to indent any L2 languages not correctly indented. (I am a bit loath to automatically go the other direction because there may be unrecognized etym-only varieties nested underneath a language that we don't want to automatically unindent.
Along with this proposal I propose that the names of L2 and etym-only languages as appearing indented under macrolanguage headers should *ALWAYS* match the actual name of the L2 or etym-only language, except for a small, well-defined set of exceptions. Possibly, for example, Bokmål and Nynorsk could be exceptions (the policy just enumerated would call for 'Norwegian Bokmål' and 'Norwegian Nynorsk'), although I must say I don't see a compelling reason to make such an exception. Benwing2 (talk) 07:40, 24 April 2024 (UTC)
One more issue: Should we use "Latin" or "Roman" in reference to Latin-script entries in languages such as Serbo-Croatian, Uzbek and Ladino? Serbo-Croatian, which is the split-script language with by far the largest number of translations, favors "Roman" but for other languages, current usage is split. I favor "Latin" as that is the name of the script used both here and in Wikipedia (the Wikipedia article is called Latin script). Stats are as follows:
OK, another issue I discovered as I try to update my script to normalize indented lang names: What about more specific varieties of existing L2's? This came up in the context of Hadrami Arabic (which is assigned an ISO 639-3 code ayh but where we merged it into Yemeni Arabic), but the same thing applies e.g. to Yemeni Arabic (San'ani), South Levantine Arabic (Palestinian), Lebanese Arabic, etc. Presumably we don't want to canonicalize these to the L2 name because that would lose information, but how should we format the extra lect info? Benwing2 (talk) 06:53, 25 April 2024 (UTC)
@-sche No one seems to care much about this topic; I'm wondering if you have any opinions. If not I'll probably proceed with what I proposed above. Benwing2 (talk) 03:42, 5 May 2024 (UTC)
Regarding how not to lose information about a lect we merge into another lect: maybe at that point we just use qualifiers? Assuming you mean Hadhrami (we have "xhd" "Hadrami" as a full language, the extinct language spoken 800 BC – 600 AD as distinct from "Hadhrami"-the-modern-dialect), then an etym-only language like "Hadhrami Arabic: foo" could become "Yemeni Arabic: foo(Hadhrami)" or "...(Hadhrami Arabic)"? That seems like it would work as far as cleaning up existing uses without losing info. How to make that systematically addable using the translation-adder (as opposed to just cleaning up existing uses), well... in theory, in the same way it resolves "Latin" to "la", it could resolve someone typing in "Hadhrami" (or the etym-only code for Hadhrami) to the code for Yemeni Arabic + a qualifier, but getting that to work might (or might not) be a lot of work (and there doesn't seem to be much demand to add Hadhrami, so that work seems low-priority). An alternative might be to just not include support for when someone inputs those particular etym-only languages/codes, so the adder throws an error and people have to look up what code they need to use and in the process realize they need to use a qualifier, but this is obviously a lot less user friendly. - -sche(discuss)15:26, 5 May 2024 (UTC)
Re "Latin" vs "Roman": historically, we had to use "Roman:" because the gadget would take "Latin:" to mean the language, and so in certain cases I don't offhand recall the exact parameters of (adding a language that would come alphabetically next to Latin in the list of present translations? and/or when adding a Latin-language translation, maybe only if there was not a Latin-language but was a Serbo-Croatian etc Latin-script "Latin:"-line in the list? we should also double-check that there's no issue adding a Serbo-Croatian Latin-script form when there is already a Latin-language translation), it would add the translation in the wrong place. Whether this is still an issue and hence a blocker to using "Latin", I don't recall; it should be possible to test. (Also, in the same way that users perennially take Ladin to be a typo and try to "fix" it (abuse filter 45), some human users might take "Latin:" to mean the language, and either try to "fix" it or be confused.) - -sche(discuss)15:26, 5 May 2024 (UTC)
I think I was able to find what you are referring to. If you add a "Latin:" translation indented under Serbo-Croatian, and then attempt to add a Latin language translation, the translation adder throws an error and says "please reformat". I'm guessing it assumes the Latin entry under Serbo-Croatian is for the Latin language, since sometimes people put languages indented under other languages; although in that case I don't know why it didn't just try to add the new Latin entry to the existing Latin: entry under Serbo-Croatian. However, this shouldn't be too hard to fix (I found where in the code it throws the error, which is in the portion that searches for where to insert), and this will come up with other cases where languages have the same names as scripts, e.g. Arabic and Hebrew. I would rather fix this than continue using a substandard workaround. The idea is to check if an indented thing is potentially a script, and if so, not assume it's a language. As for users trying to "fix" "Latin:", this seems less likely to happen than with Ladin, both because people are more familiar in general with Latin as a script name and because "Latin:" will often be paired with "Cyrillic:", which should make it clear it's not a mistake, and in addition because I think Serbo-Croatian is less likely to look like Latin than Ladin is (since Latin and Ladin are both Italic languages).
As for Had(h)rami (note, Ancient Hadrami and Hadhrami Arabic are based on the same underlying word, just romanized in different ways), I tend to agree with you that qualifiers are a good way of doing this. Under "More", the translation adder has a box for adding qualifiers, so this is probably good enough for the moment. Benwing2 (talk) 02:35, 6 May 2024 (UTC)
Latin-script Yiddish
I encounter Yiddish written in Latin script a lot (more often than I encounter it written in Hebrew script, actually, although my understanding is that overall Hebrew script is much more common?). Unlike with e.g. romanized Cyrillic, I do not get the impression that authors are using Latin script because they can't typeset Hebrew: Latin script is simply one of the scripts that people who use Yiddish use, like (and indeed to perhaps even a greater extent than) e.g. Arabic is one of the scripts people write Afrikaans in. I would therefore like to suggest that in the way we have Afrikaans entries like اِتْسْ with the definition "Arabic spelling ofiets", or at the very least in the way we have Gothic entries like aiwaggeli as "Romanization of𐌰𐌹𐍅𐌰𐌲𐌲𐌴𐌻𐌹", we should have entries for Latin-script Yiddish pointing people to the Hebrew spellings. - -sche(discuss)21:06, 21 April 2024 (UTC)
@-sche I agree and I think there should be a better way of formatting it than the way done at اِتْسْ, which uses {{form of}}, and a better way of categorizing than using sccat= (maybe {{head}} should auto-categorize terms that are in a script other than the "dominant scripts" of the language, whatever those may be?). Either a new template, something like {{script spelling of}}, or a particular way of using the existing {{spelling of}} template. Benwing2 (talk) 22:05, 21 April 2024 (UTC)
The approach of using |sccat= in {{head}} currently includes the POS in the category name, whereas the approach with {{spelling of}} does not, since it isn't available. Which do you think is better? In the case of Afrikaans, we only have 2-3 terms per POS so probably they should be combined, but I could see the opposite argument being made when there are lots of such terms. Benwing2 (talk) 01:18, 22 April 2024 (UTC)
Is {{rfv-pron}} the correct template for challenging pronunciations? It puts items in Category:Requests for references for pronunciations in Lithuanian entries, but I think references will often not be the resolution. Additionally, this is Wiktionary, not Wikipedia, so we are achieving a poor substitute for our goals when we merely select from other dictionaries. (The word that prompted this question was Lithuanian policija, where I am after the truth, not references.) --RichardW57m (talk) 09:33, 22 April 2024 (UTC)
You can reference pronunciations by means of primary sources innit. Corpus linguistics. Though I agree with the interpretation that the author of the template text was Wikipedia-infected and thus designed it otherwise. Fay Freak (talk) 16:03, 22 April 2024 (UTC)
Modify/deprecate NFCC or request re-enabling Special:Upload for all users?
Now there's a panorama I don't know if it's fair for us or not, on one hand, we have a policy called Wiktionary:Non-free content criteria (NFCC), describes how contents, include files, are accepting fair-used copyrighted non-free parts. But on the other hand, the well-known uploading form, Special:Upload, is restricted to administrators only here, this lead the NFCC policy has somewhat concept conflictions with uploading configuration, so far:
Do we still need NFCC without further local file uploads? As such, I guess these texts are no longer needed in NFCC (one under Policy section, and two under Enforcement):
"...including copyrighted images, audio clips, videos and other media files."
"A file with a valid non-free-use rationale for some (but not all) of the places it is used in will not be deleted. Instead, the file should be removed from the places for which it lacks a non-free-use rationale, or a suitable rationale should be added."
"If a user suspects that a file does not meet these criteria, that user should list that file on WT:RFDO. Files may be deleted after discussion."
Or can the NFCC be just deprecated? Or
Are there reasons we can accept admin-only local uploads? Shouldn't that be either fully disabled? Or re-enabled for all users for fair using files?
It should be amended to explicitly state that only admins can upload. We should not allow general uploads and there should almost certainly be very, very few pieces of non-free media here. —Justin (koavf)❤T☮C☺M☯04:29, 24 April 2024 (UTC)
I don't support broadening the ability to upload images. The NFCC is badly drafted imho, but I suspect nobody else really cares, nor does it actually matter. On the other hand, I'd support adding a note along the lines of what Koavf proposes. We could also say "in the extremely rare case that an image upload is required, non-admins can make a request at the grease pit". This, that and the other (talk) 09:18, 24 April 2024 (UTC)
@Ioaxxere Given the vanishingly small number of such images and the heavy restrictions on their presence, I would Oppose this until/unless we significantly relax the restrictions on non-free images. Benwing2 (talk) 06:58, 25 April 2024 (UTC)
@Benwing2: "Vanishingly small" is one way to put it — there's exactly one! (on thagomizer). It seems like the current low quality is mostly due to disinterest as opposed to heavy restrictions. The NFCC criteria are actually fairly broad. Ioaxxere (talk) 07:12, 25 April 2024 (UTC)
Comment: not sure it's a good idea unless we have enough volunteers able and willing to assess uploads and determine if they meet any NFCC, or whether they need to be deleted as being copyright infringements. There are many more experienced volunteers doing this at the Commons and the English Wikipedia, and they have to deal with many people who either don't understand or don't care about copyright or any NFCC uploading material under incorrect licences. — Sgconlaw (talk) 17:56, 25 April 2024 (UTC)
CFI for constructed languages
to note: this started as a discussion on the Wiktionary Discord server, where I asked how to include a constructed language in Wiktionary. I was told that there was currently a debate on whether these should be included at all, with contention also on languages already on mainspace, such as Ido and Volapük.
I am a tokiponist and wish to see it in Wiktionary on the future. the arguments listed against the inclusion of more conlangs and the exclusion of the current ones seem weird to me. they are as follows:
these are not natural languages and have no native speakers, especially no native monolingual speakers. the several ways that the speakers' native language affects the conlang leads to many different fragmented styles of speech, in turn leading to the standard way of speaking being the one defined by its creator. additionally, the way that L1 and L2 speakers interact with their language is different. for this, language without native speakers should be removed.
Wiktionary's stated goal, as per WT:CFI, is to include "all words in all languages", and I don't see how these two arguments are relevant to the exclusion of these languages. on this discussion, I want to try to make a case for Toki Pona, as I'm not a speaker of any other excluded language.
first of all, even if these languages have no native speakers, they are still spoken (or have in some point in the past been spoken) by a large community. for Toki Pona, as of the latest census as of 2022, there were more than 1400 respondents, and this only counts mainly of the speakers who did respond, as many haven't or weren't even reached by the survey. this figure is larger than or comparable to other languages on Wiktionary, such as Ido and Interlingua. there are possibly more Toki Pona speakers and these would want to definitions for their language.
me and many contributors on the sona pona community, a Toki Pona wiki, have been collecting data on the works written in Toki Pona which were either physically published or freely licenced and available (or planned to be) on Wikisource. as of today, there are up to 125 authors, with many words with up to +100 independent citations. these are following Wiktionary guidelines on durably archived works above. if we were to include all of the available literature (possibly archived with the Internet Archive), these figures would certainly double in size.
these have many published books not only about the language but written in the language. this seems to be a big barrier as for many of languages as book publishing is expensive and takes time (never published one, I don't know).
in the Toki Pona community, there is not any superior guiding authority. the "standard" is mostly decided by what people consider to be correct and what is more widepsread. some of what was written by the creator herself has since (and even at the time) been considered weird, as her books reflect only the way she speaks, while speakers have all control in the language.
a question may arise whether actually non-notable languages may be included by this change of policy. however, this would only be the first step that a conlang must take, as it must also have a population of speakers, citations spanning more than a year, possibly an ISO 639 code or something similar
@JnpoJuwan: You make some convincing points but can I ask why you'd like Toki Pona to be in mainspace? It seems like centralizing all Toki Pona-related information under Appendix:Toki Pona is the ideal approach given that Toki Pona has very few words in comparison with most languages. Ioaxxere (talk) 20:10, 25 April 2024 (UTC)
I'm generally against conlangs without native speakers as natives are what determine naturalness and correctness within a language. I think Toki Pona should remain an appendix language until things change. Vininn126 (talk) 20:24, 25 April 2024 (UTC)
Why should there be a need for "naturalness"? Conlangs aren't natural languages anyway. And correctness can be jugded on another level: Is a term formed correctly (like Esperanto nouns ending in -o)? Does a construction follow the language's rules (word-endings and word-order)? --2003:DE:374E:E207:DC5C:89A9:BD02:9B5221:32, 25 April 2024 (UTC)
I personally think all conlangs other than Esperanto should be in the Appendix, so I would oppose moving Toki Pona to mainspace (and support moving Ido, Interlingua and Volapük out of mainspace). Benwing2 (talk) 20:31, 25 April 2024 (UTC)
We need to draw a line somewhere on what we call 'language'. Obviously allowing anything is untenable, as we will be flooded by one-person conlangs. Allowing anything where we can find at least three cites for at least one word is possible, but pretty useless: Having a handful of words in the namespace for a conlang with thousands because only three resources in the language are durably attested isn't great for anyone. "Conlangs with native speakers stay, those without go" is a pretty valid line to draw. Thadh (talk) 21:45, 25 April 2024 (UTC)
of course having a small number of wordss for a conlang because there are only three sources would not be helpful. but why jump to conclusions like this? what is the normal process for adding a language to Wiktionary for natural languages? Juwan (talk) 22:27, 25 April 2024 (UTC)
I should note, I expressed this same opinion in Wiktionary:Beer_parlour/2023/February#Is_it_time_to_look_at_Toki_Pona_again? and 4 people agreed with me, including several of our most prolific contributors. I agree with User:Thadh that we need to take native speakers of conlangs into account, and most conlangs have very few (if any) true native speakers. I make an exception for Esperanto because it is well-known to have several thousand native speakers, which AFAIK can't be said for any other conlang. Benwing2 (talk) 21:51, 25 April 2024 (UTC)
@Theknightwho My concern about having "any speakers at all" for conlangs is that someone will then point to a survey somewhere claiming 2 native speakers in Hungary that will lead to some obscure conlang ending up in the mainspace. A lot of claims of native speakers for conlangs are exaggerated and I doubt there are any conlangs at all with monolingual native speakers, whereas all natural languages have or had monolingual native speakers. Benwing2 (talk) 22:54, 25 April 2024 (UTC)
That is not entirely true. Pidgins are natural languages which don't usually have any native speakers. And it's definitely possible - in some cases probably even likely - for a creole language to form without any monolingual native speakers. --Spenĉjo (talk) 23:27, 25 April 2024 (UTC)
There has to be some measure that can be followed. Otherwise, we'll have cases like the infamous case of a father trying to have his son become the first Klingon native speaker. AG202 (talk) 03:37, 26 April 2024 (UTC)
To review some of the arguments in that discussion:
isn't capable of expressing concepts outside its scope by design. Any translation from English into Toki Pona and back again would result in a distorted message.
This may have been thought to be the case as of a decade ago, but per @CitationsFreak's reply, it is not accurate. Speakers have discussed far-out-of-scope concepts such as non-Euclidean geometry, the mRNA vaccine, and the theory of relativity in Toki Pona. The lack of jargon doesn't preclude effective circumlocution.
These and other articles are context-rich enough that they can be translated into a natural language without an unusual level of distortion.
So you say that over a thousand people use Toki Pona. Sweet, but do they publish durably archived works? can Toki Pona entries meet our criteria for inclusion?
When I looked into this last year , I was able to find only two such works, and they were by the same author. I don’t know whether that has changed.
The spreadsheet linked in @JnpoJuwan's original post is part of an effort to address this. We can point to:
Several other published books
A printed zine that has been ongoing since 2021
Creative works by other authors, particularly poems and comics, that have been included in those sources
The initial 120 words have been used in many works that we believe meet the attestation criteria, by well over a dozen authors each. So have several words beyond that set, in spite of being contested as "unofficial" for the better part of a decade.
At least a couple dozen words ought to qualify for "clearly widespread use", being present in 100 authors' works (out of 125 authors as of writing):
And maybe the threshold should be lower; I just want to be safe for the purposes of this discussion.
Yes, most of the Toki Pona entries are currently light on citations. I and several other editors intend to address this. It would be perfectly fair to wait to mainspace Toki Pona until the entries have rigorous attestation. (Of course, that is a different discussion than the idea of barring Toki Pona from mainspace forever.) AgentMuffin4 (talk) 00:00, 26 April 2024 (UTC)
As for durably archived works, it's worth adding that Toki Pona is usually said to have about 120 to 140 words, but according to the spreadsheet there are at least 150(!) words with works from 3 or more different authors that are or can be durably archived. Probably not all of those words would qualify for attestation, because in a few cases it's articles from 3 contributors of the same magazine, which doesn't make for very independent sources. But the vast majority can probably qualify, seeing as 138 words have 10 or more authors listed, and 124 words have 25 or more authors. --Spenĉjo (talk) 15:15, 26 April 2024 (UTC)
As for "natives are what determine naturalness and correctness within a language", this tends to be true for most natural languages, but definitely not all of them.
For example, if I'm not mistaken, in Swahili the L2 speakers vastly outnumber the L1 speakers. And even among L1 speakers, many are only first or second generation L1 speakers who learned to speak the L2 way of speaking, instead of learning the grammatically more complex traditional dialects still used among the Swahili people. I don't know for sure, but I'm fairly confident that the vast majority of Swahili content on Wiktionary describes usage and formal standards that were mostly shaped and decided by L2 speakers. This is also the case for many (if not most) creole languages.
And this is definitely also the case for Esperanto.
As a fluent Esperanto speaker and active member of the Esperanto community, in my opinion the Esperanto native speakers are made into a far bigger deal by non-Esperantists than they should be. Esperanto native speakers are mostly indistinguishable from L2 speakers who have actively used the language for a couple of years. In fact, because none of them have had an education in Esperanto, on average they even tend to be less proficient when it comes to spelling, grammar, and similar types of "formal correctness" than fluent L2 speakers, who have actively studied the language to become proficient.
What determines naturalness and correctness within Esperanto (and, in my opinion, in any language), is the active core speaker base - most influentially the published authors, the magazine editors, the teachers, the writers of textbooks and online courses, the people who spend a lot of time in discussions about how to best express certain concepts in the language, etc. Some of those people are native speakers, but the vast majority are not. People who have dedicated decades of their lives to Esperanto aren't in any meaningful way less valuable and influential than Esperanto native speakers. And if all native speakers were to magically disappear tomorrow, neither the language itself nor its notability would noticeably change.
So I emphatically disagree that "has native speakers" is a meaningful metric. It seems like a relatively easy line to draw in the sand (if you ignore dubious or exaggerated claims of native speakers of some conlangs), but it is not one that says a whole lot about the degree of activity and language proficiency of its community, the size and quality of its literature and music, the degree to which the language is owned by the speaking community (as opposed to being codified in formal standards determined by only a small number of people), or to which it is experiencing natural evolution despite its artificial origins, etc. You can have any and all of those things without native speakers, and you can also have a few native speakers without any of those, if you have one parent who is dedicated enough and knows what they're doing. --Spenĉjo (talk) 23:21, 25 April 2024 (UTC)
It is. If it has no native speakers then there is a high likelihood of it being a playground for psychiatric disorders. Those with schizoid, perfectionist, or avoidant personality traits. I formulate extra-carefully due to the multiformity of the language-capable hard cases we actually meet. It’s not funny.
Reality was too complex to control so we would have to invent peculiar judging standards to play against them. I don’t wanna play against them. I tell them to stop it and get back to learning something real, not to say useful, hopefully.
Only that a psychologist would also have to have a language special interest to finally write a paper about it; the incidence of such a thing was not high enough for it to happen, the visible distress of the maladaptive behaviour exhausted in producing paper and websites that will be read by few and probably vanish, from but free time—which you are in the right to expend but we are overqualified for. Fay Freak (talk) 00:13, 26 April 2024 (UTC)
To make absolutely sure I'm not misinterpreting:
Someone who speaks a language with no native speakers probably has a psychiatric disorder (!?);
Most constructed languages have no native speakers;
Therefore, only people with psychiatric disorders use these languages;
Quotes by people with psychiatric disorders don't count towards CFI (!?);
Therefore, these languages cannot meet CFI.
If there is some respectful interpretation here that does not involve strawmanning, ableism, pathologizing a hobby, or moving the goalposts, then I sincerely, profusely apologize for so wildly misconstruing it. AgentMuffin4 (talk) 02:09, 26 April 2024 (UTC)
Everyone is talking about the required notability to get into mainspace. It makes me wonder if this means there is no current notability to get into a appendix. If there isn't then can I make a appendix for my conlang that only I speak? (and if there is then what's the point?) --2007GabrielT (talk) 00:27, 26 April 2024 (UTC)
@2007GabrielTUser:Fay Freak's writing is sometimes hard to understand but they are correct that you can't just make an appendix for your own conlang; before doing that, you need to make a request in the Beer parlour for this and get consensus that the conlang is notable enough to be recorded in Wiktionary. If you ignore this process and just start creating appendix entries, your entries are liable to get deleted. The issue with the mainspace is that the bar is much higher for conlangs in the mainspace, which is why there are so few of them (few enough to be counted on one hand), and many of the ones that are there were essentially grandfathered in. (There used to be more, e.g. I think Novial and Interlingue used to be in the mainspace but were moved out due to a vote.) Benwing2 (talk) 07:36, 26 April 2024 (UTC)
If a conlang is notable it should be notable. Why would it matter where on the website it is? If its notable enough to be on wiktionary it should be notable enough to be… on wiktionary 2007GabrielT (talk) 15:11, 26 April 2024 (UTC)
no, however, these are books, zines, podcasts, blogs and collections of poems and stories by multiple authors, Wikisource is simply being used as somewhere to collect and store them. Juwan (talk) 10:44, 27 April 2024 (UTC)
why does it matter that they are self-published instead of going through the expensive process of publishing with another publisher? Juwan (talk) 11:05, 27 April 2024 (UTC)
It can affect the durability. Being on Wikiquote and such is something, but we've not always accepted self-published books. Lack of an editor and the like. Vininn126 (talk) 11:07, 27 April 2024 (UTC)
As far as I can tell, it's not very well defined in our CFI. People have brought up the subject before, leading to much debate about what it means, so unfortunately, I'm not sure there is a clear definition I can give. I bring up self-publishing more as we have generally not allowed it in the past, like I said, for having a lack of editing and the like. Vininn126 (talk) 11:17, 27 April 2024 (UTC)
If lack of editing is the main issue, it seems to me that these two sources would have a good chance of passing the CFI:
lipu tenpo (2021-ongoing, ISSN 2752-4639) is an short magazine (about 16 A5 pages per issue) that as of now has published 25 issues, which have multiple editors credited for each issue. (It has a CC BY-SA 4.0 license, so its PDFs can be - and have been - durably archived on Commons, with transcription on Wikisource ongoing.)
The Toki Pona version of The Wizard of Oz by Sonja Lang (2024, ISBN 978-0-9782923-7-9) is a self-published book that was thoroughly edited by at least two of the five people credited as proofreaders (including myself).
The Toki Pona Bible Project (a CC0 project with translations checked by both proofreaders and biblical language specialists) is for the most part still very much a work in progress, but earlier this year they released a complete translation of the book of Jonah.
jan Sitata (2022) - a CC0 translation of the 1922 novel Siddhartha by Hermann Hesse. There are two people credited in Toki Pona as "searching for small mistakes and small changes". I'm not sure whether that qualifies as editing.
lipu kule (2021-ongoing) - a CC BY-SA open collaborative blog or online magazine of sorts. I'm not sure how much editing they do (and they don't have an about page explaining stuff like that), but I'm sure they at least proofread the articles for spelling and grammar.
lipu monsuta (2021-ongoing) - a CC BY-SA yearly horror anthology. Having contributed in one issue, I know that they do proofreading and some editing, but not a lot. My impression was that the editor tends to respect the author's choices unless there's something obviously wrong or badly phrased. My own novelette-length submission was looked at by many volunteers, with notable suggestions from three contributors.
Notable, but doesn't pass CFI:
jan Keta li weka! (a 2022 translation of Gerda malaperis!) was proofread and/or edited by four people, but at the moment it isn't durably archived and can't easily be archived due to copyright issues.
I should point out that there is some author/editor overlap between those different sources. For example, Sonja Lang was the author of The Wizard of Oz, one lipu kule article, and one text in lipu monsuta, and was also proofreader or editor for jan Sitata and the bonus texts in Tokipono: La lingvo de bono. But if the first two sources as well as any one of the Bible project, lipu kule or lipu monsuta would qualify, we should be able to attest most Toki Pona words with three high-quality, durably-archived, independent sources. --Spenĉjo (talk) 17:29, 27 April 2024 (UTC)
This is all still based on the idea that we'd want to include conlangs to begin with. I've stated my opinion above on what criteria should be included. Vininn126 (talk) 18:01, 27 April 2024 (UTC)
I am aware. But this thread of the discussion is about someone suggesting that "conlangs should have a decent amount of their words be attested three times in books", so this was a reply to your points about self-published works and editing in that context. --Spenĉjo (talk) 22:12, 27 April 2024 (UTC)
out of the ones listed are as follows:
Title
Short description
kili lili
short poem posted to Knight's site
toki suli pi jan Jesu
translation of a Bible posted to Knight's site
lipu lawa
contract made by Gabel and Martin as a piece of art and to prove the legitimacy of toki pona, self-published to the Toki Pona Forums
Toki Pona: The Language of Good
physical publication: original edition published by Tawhid, translated editions either by the same publisher or self-published by Lang
Toki Pona Dictionary
physical publication: self-published by Lang, proofread by several contributors
The Wonderful Wizard of Oz (Toki Pona edition)
physical publication: self-published by Lang, proofread by several contributors
Fingtam Languages
physical publication series: self-published by Fingtam I assume
lipu tenpo
zine: it is the publisher?
lipu kule
community blog: self-published??? I am not sure about the publication and proofreading process to answer
lipu monsuta
zine: same as above
utala musi
poetry and writing competition: self-published???
kalama sin
podcast: self-published???
toki soweli
translations of the Beatrix Potter's illustrated books: self-published by Samys
jan Sitata
translation of the Siddharta: proofread by contributors
this mostly shows that I am not completely sure what "self-published" is supposed to be.
while I do give out this information, I do NOT want it to be used as a way to invalidate the points that I, User:Spenĉjo and User:AgentMuffin4 have made. the notion of 'native speakers is what counts' is something we want to deconstruct. Juwan (talk) 13:44, 27 April 2024 (UTC)
We do not quote other Wikimedia sites (such as Wikipedia), but we may use quotations found on them (such as quotations from books available on Wikisource).
as was explained above. in other words, if you can find a citation from a source that passes CFI which is recorded on Wikisource, the fact that it's on Wikisource doesnt make it invalid. We need to state this to prevent the inevitable misunderstandings about "no copying from other wikis" ... we're not copying from Wikisource, we're copying from a third-party work that's being featured on Wikisource. But note that Wikisource does not have the same stringent criteria we do ... they follow their rules. So merely being on Wikisource doesnt mean something passes CFI here. —Soap—11:35, 27 April 2024 (UTC)
Auto-protection of highly visible templates/modules
I just made a bot script that can do this - perhaps we should do this, given that the occasional spates of template vandalism we get (like earlier today) are highly disruptive. — SURJECTION/ T / C / L /17:30, 25 April 2024 (UTC)
Weak support - I think SCORE_TO_PROTECT is currently too low. 1000 is a mere 500 entries in mainspace; there are plenty of language-specific templates that are used in more entries than that, but still need to be edited semi-regularly.Lunabunn (talk) 19:54, 26 April 2024 (UTC)
@Lunabunn Semi-protection is a pretty low bar; AFAIK it just means you have to have an autoconfirmed account, which happens to all accounts after a certain (relatively low) number of edits. Benwing2 (talk) 20:01, 26 April 2024 (UTC)
@Surjection @Benwing2 Ah, upon only skimming through the script I was under the (mistaken) impression that it would lock the page. Semi-protection should be fine; thank you for the clarification! Strong supportLunabunn (talk) 20:21, 26 April 2024 (UTC)
I am writing to you to let you know the voting period for the Universal Code of Conduct Coordinating Committee (U4C) is open now through May 9, 2024. Read the information on the voting page on Meta-wiki to learn more about voting and voter eligibility.
The Universal Code of Conduct Coordinating Committee (U4C) is a global group dedicated to providing an equitable and consistent implementation of the UCoC. Community members were invited to submit their applications for the U4C. For more information and the responsibilities of the U4C, please review the U4C Charter.
Please share this message with members of your community so they can participate as well.
User:-sche Sorry to keep pinging you but I'm guessing you may know what's up here. Category:Western Panjabi language (code pnb) says all lemmas should be placed under Punjabi (code pa), and it's apparently been that way since at least 2020 (when User:Kutchkutch asked the same question in the Beer parlour), but we still have an L2 entry for Western Panjabi. WT:LT says nothing about Punjabi. I would like to clean this up properly because we still have a bunch of Western Panjabi categories as well as 366 translation entries for "Western Panjabi" (only 5 of which are nested under Punjabi) and 4 translation entries for "Western Punjabi" (3 of which are nested under Punjabi). Can we eliminate code pnb and agree on the spelling "Punjabi" instead of "Panjabi"? Benwing2 (talk) 00:05, 26 April 2024 (UTC)
I believe this is another case where people didn't get around to properly finishing a language merger. Dijan added the notice that "Western Panjabi" should just be "Punjabi" back in March of 2013, but apparently didn't actually remove the code (T:pnb wasn't deleted till November 2013, around the time the code was moved to Module:languages, where it's been ever since). I believe you can just finish the merger. Pinging @عُثمان who is AFAICT our only recently-active natively Punjabi-speaking editor. (Wikipedia sez of Western Panjabi "Its validity as a genetic grouping is not certain. The terms "Lahnda" and "Western Punjabi" are exonyms employed by linguists, and are not used by the speakers themselves.") - -sche(discuss)03:20, 26 April 2024 (UTC)
@-sche @Benwing2 Yes I would agree this is an unfinished merger and that there is no reason to treat these separately. I also agree "Punjabi" is a preferable spelling as it is still the spelling used officially in India and Pakistan even though it the more archaic one.
There are maintenance issues inherent in maintaining information about languages which are written in multiple writing systems. I focus most of my editing on Wikidata lexicographical data and long-term I don't see much future in maintaining separate entries for the same words in a different writing system. For example, it should be possible to get all the definitions, references, spellings, and dialectal forms of inflections from a single data entity like this: https://www.wikidata.orghttps://dictious.com/en/Lexeme:L686283
(Technically there is nothing stopping anyone from implementing this on Wiktionary now, I would just personally like to focus my own efforts on the Punjabi-language Wiktonaries first.) عُثمان (talk) 14:05, 26 April 2024 (UTC)
@-sche @عُثمان I see. I didn't realize that Punjabi can be written in either Gurmukhi or Shahmukhi, and that information is duplicated across the two scripts in a haphazard fashion. IMO the information needs to be in one of the scripts, and propagated to the other either using a {{tcl}}-like solution or through soft redirects. What do you think is the best way of doing this? Which script should be the canonical source of information? (I know this probably has political ramifications but I don't see any way around this, except to do something like put the information in neither script but in the Latin script. A {{tcl}}-like solution would make it appear, to the user a least, that both scripts are equal, although it's a bit trickier to implement than soft redirects. The only other alternative I can think of is to haphazardly choose one or the other script as the source of information on a per-word basis, a bit like what is done for English when there are British vs. US spelling differences; but this seems very messy to me.) Also, are there cases of lexical differences across the two scripts, or is it reasonable to have the same information displayed in all cases for both scripts (with labels to distinguish India vs. Pakistan uses)? Benwing2 (talk) 04:07, 27 April 2024 (UTC)
One other thing is that there are parallel inflection templates like {{pa-noun-f-c}} and {{pnb-noun-f-c}} that claim to be respectively the "Punjabi" and "Western Panjabi" versions of the same underlying templates but are really just the Gurmukhi and Shahmukhi equivalents. I propose to rename them e.g. {{pa-Guru-noun-f-c}} and {{pa-Arab-noun-f-c}} to reflect their actual purposes. Benwing2 (talk) 04:40, 27 April 2024 (UTC)
@Benwing2 There are no lexical differences across the scripts and every word form can be represented in either script. A majority of speakers live in Pakistan, but an issue with treating one over the other as canonical comes up with homographs. The words حال (from Arabic via Persian حال) and ہال (from English hall) are both spelled ਹਾਲ in Gurmukhi, while the words ਤੂੰ and ਤੋਂ are both توں in Shahmukhi.
The inflection tables are overdue for a complete overhaul, particularly the verb ones, and the way verbs are lemmatized on enwiktionary is poorly suited to the language as it uses a form which doesn't exist for all verbs.
Regarding the language codes: Wikimedia has been using pa and pnb for the two scripts for a very long time and the respective Wiktionary projects use these codes. It doesn't matter too much, but pnb is more consistent with other Wikimedia projects. عُثمان (talk) 05:11, 27 April 2024 (UTC)
@عُثمان You've brought up several issues. My thoughts:
If we use a {{tcl}}-type solution and are consistent in treating one script as canonical, we can avoid the issue of homographs by using etym ID's of some sort (either Wikidata ID's or English descriptive words) for the words that become homographs in the canonical script but are separated in the other script. It's also possible to use one script as canonical except in the cases where that script has homographs and the other doesn't. As for using Wikidata lexicographical data, I don't know what that would entail. Can you explain how Wikidata works? From what I've seen, Wikidata does not do as good a job as English Wiktionary at representing lexicographical information, and the information structures are extremely different and likely incompatible (for example, AFAIK no one at the English Wiktionary has ever been consulted by Wikidata on how to best represent lexicographical information). I also think it would be difficult for contributors to use; they'd have to understand the Wikidata information structure as well as Wiktionary's structure, and overall I would rather keep all the information in Wiktionary itself if at all possible. However, maybe I'm wrong here.
As for pa vs. pnb, if we're trying to retire pnb at English Wiktionary in favor of pa, I don't really think it would be good to continue to use pa vs. pnb as a proxy for the different scripts when we already have script codes (Guru and (pa-)Arab) to directly represent the differences.
As for the way verbs are lemmatized being wrong, can you explain more? I am happy to write a bot script to move the terms as necessary, but I don't know Punjabi very well so I'd need a little help.
As for overhauling verbs, note that a couple of years ago I overhauled Hindi verbs and wrote a proper module to implement them. Maybe this module could serve as a basis for Punjabi? How different are Hindi and Punjabi verbs?
As for pa vs. pnb, if we're trying to retire pnb at English Wiktionary in favor of pa, I don't really think it would be good to continue to use pa vs. pnb as a proxy for the different scripts when we already have script codes (Guru and (pa-)Arab) to directly represent the differences. – I agree. It would be confusing for users, and it's just better to use the ISO-1 'pa' code. نعم البدل (talk) 15:49, 29 April 2024 (UTC)
@عُثمان: The inflection tables are overdue for a complete overhaul – I agree but your suggestion is really not feasible. There is no possible way to 1. attest every infliction, 2. record it at Wikidata for every single Punjabi lemma. نعم البدل (talk) 15:45, 29 April 2024 (UTC)
@نعم البدل I don't think it is possible to cover every verb and inflection, but at least the most common verbs and their inflections. The main technical hurdle I see is figuring out how to allow the table to expand as forms are added. I intend to write some documentation on the grammatical features I've been using on Wikidata in the near future. عُثمان (talk) 15:56, 29 April 2024 (UTC)
Which script should be the canonical source of information? I know this probably has political ramifications but I don't see any way around this.
Gurmukhi is used officially in India and the script usually unambiguous, while AFAIK Shahmukhi has no official status in Pakistan and the script can be ambiguous. There are more speakers of Punjabi in Pakistan where the script is Shahmukhi.
How different are Hindi and Punjabi verbs?
Technically Hindi/Urdu and Punjabi are in different subgroups of Indo-Aryan, so one would expect there to be considerable differences. Although they are not mutually intelligible, a speaker of Hindi/Urdu who is not competent in Punjabi can understand quite a lot if certain aspects are understood.
/-t-/ in Hindi/Urdu becomes /-d-/ in Punjabi
मिलता (mil-t-ā) → ਮਿਲਦਾ (mil-d-ā)
Vowel-final verb stems have extra -n- appended to them
खाता (khā-t-ā) → ਖਾੰਦਾ (khā-n-d-ā)
The infinitive marker can either be -ਣਾ (-ṇā) or -ਨਾ (-nā). There is a rule to determine which one it should be based on the last sound of the verb stem.
The masculine progressive marker रहा (rahā) in Hindi/Urdu becomes ਰਿਹਾ (rihā).
I don't really think it would be good to continue to use pa vs. pnb as a proxy for the different scripts when we already have script codes
I agree.
@عُثمانAs for using Wikidata lexicographical data, I don't know what that would entail.…I also think it would be difficult for contributors to use
(a) we should consider using Gurmukhi as the canonical script simply because it's likely to be more unambiguous than Shahmukhi, which (AFAIK) omits many vowel markings; number of speakers and official vs. non-official status seem like extra-linguistic variables that shouldn't play a role here; I will wait for more input though;
(b) the verbs are structurally similar enough that we could probably adapt the Hindi verb module to Punjabi without too much difficulty;
(c) we should use pa-Guru and pa-Arab; I will go ahead and implement that soon (NOTE: in the longer run this can be auto-determined by looking at the pagename or specified stem or lemma);
(d) we need more info on how Wikidata can help us;
(e) we need more info on why the current choice of form to lexicalize verbs on is incorrect and needs changing (per User:عُثمان).
@Kutchkutch @Benwing2 I am happy to offer a more detailed answer to these questions when I have some time as there are a lot of details to cover. With all due respect however, there are some incorrect assumptions here before I get to that.
Gurmukhi is ambiguous in a way Shahmukhi is not. A common example is حال (from Arabic) and ہال (from English "hall"). These are both spelled ਹਾਲ in Gurmukhi. Gurmukhi was developed and standardized relatively recently historically, and the earliest texts recognizable as modern Punjabi are transliterations from Shahmukhi into Gurmukhi. About a third of the Punjabi lexicon is Perso-Arabic in origin and includes a large number of homophones which can only be distinguished in their Perso-Arabic spellings, and this is a large part of why there will likely always be two writing systems.
I would not consider Hindi/Urdu and Punjabi to be separate subgroups of Indo-Aryan. Together, Sindhi, Punjabi, and Hindi/Urdu form a continuum, and have the same underlying grammar with some significant differences in syntax and phonology. They would have been one language up until some point after the 11th century (relatively recently).
Hindi/Urdu has lost a large number of distinctions between verbal forms which have been retained in Sindhi and Punjabi. عُثمان (talk) 12:08, 28 April 2024 (UTC)
@Benwing2 @Kutchkutch The user interface stops me from entering longer messages, I may have to change a setting here.
Punjabi verbs do not have infinitives. There are gerunds in -ṇ or -n, and potential participles in -ṇā or -nā. In eastern dialects, the direct case forms of gerunds add ā but this is still dropped in the oblique case (this is a preservation of the neuter gender ending of verbal nouns in Sanskrit). In Hindi/Urdu the gerund and potential participles are identical and many grammars may treat them as the same and call them the "infinitive". Grammatically there are still two distinct forms however and for Punjabi, I use the oblique form of the gerund as the lemma for verbs in line with Salahuddin's Waddi Punjabi Lughat dictionary. (Which I would consider the highest quality dictionary of the language to date.)
There are separate participle forms which use -t- and -d- in Punjabi. Some dialects like Malwai still use both while others have lost the -t- forms. Pothohari uses -n- instead of -d- as a result of assimilation of the preceding nasal consonant. (-nd- can change to -d- or -n- and alternations originating from this cluster exist in all dialects, with no particular pattern. It is just a general feature of the phonology of the language.)
Punjabi verbs in both Gurmukhi and Shahmukhi are not spelled as pronounced. They use morphophonemic spellings which intentionally do not represent all the ways the stem pronunciation changes before suffixes as these changes happen automatically for speakers. Most common verbs also have suppletive participle forms which Hindi/Urdu has lost.
@عُثمان The issue I'm calling out with Shahmukhi is the vowel representation, which appears to have lots of ambiguities in it; please correct me if I'm wrong. I'm aware there are extra consonants in Shahmukhi that represent sounds distinguished in Arabic, so some use of distinguishing ID's will be necessary regardless. As for the differences you're calling out, they don't look like major impediments to adapting the Hindi verb module, which is written in a fairly general fashion (and in its turn was adapted from something like the Spanish verb module, where Hindi and Spanish differ far more than Hindi and Punjabi). Benwing2 (talk) 14:50, 28 April 2024 (UTC)
@Benwing2 The short version of the advantages of using Wikidata is that adapting another module would not be necessary–the forms already exist with grammatical features linked to them and these can be rendered in a table. If you take a look at the way I've modeled verbs in Sindhi, Hindi, and Punjabi, I could help explain any specific details of the forms (at the bottom, with L####-F## IDs). There are differences from the Hindi verb module as well.
Vowels can be represented with diacritics in Shahmukhi where necessary. In both systems, the actual vowel may often be represented as a sequence of vowels in order to distinguish spellings. ਆਉਣ آوݨ would be ਔਣ اَوݨ if it were spelled with the vowels as pronounced, but the spelling reflects the vowel used in the word stem instead of the one specific to the word form. The Gurmukhi diacritic ੱ is borrowed from the Arabic script as well. It is not like Devanagari in that it has been adapted to match the Arabic script in some ways. عُثمان (talk) 15:39, 28 April 2024 (UTC)
@عُثمان OK, you are proposing having no inflection module at all and instead pulling all the inflectional data from Wikidata. That definitely won't work. In that situation, there *IS* an inflectional module, it's just that it lives somewhere in an offline script we have no control over, and uploads inflections at a time of its own choosing for a set of lexemes of its own choosing. We need our own module where we can generate the inflections of *ANY* noun, verb or adjective whenever desired, and update it ourselves when there are issues. Benwing2 (talk) 06:57, 30 April 2024 (UTC)
@Benwing2 Well, Punjabi is not a language for which inflections follow the same patterns for any given part of speech. The word چاچا (paternal uncle) has a vocative case inflection but the word نانا (maternal grandfather) does not. (It would be considered rude for cultural reasons.)
If you are able to document these details and implement them in a module on English Wiktionary, I would encourage that. I am not asking or expecting anything here, I am just offering my perspective since I was pinged in this discussion. I am contributing this information to Wikidata, and intend to prioritize utilizing that information on the Punjabi Wiktionaries as there are already more editors on English Wiktionary. After all, the quality of the Punjabi material on English Wiktionary can only improve if there are more resources which are in Punjabi elsewhere, and the best sources and references are not in English. In order to develop multilingual solutions to these issues, I do find it necessary to utilize multilingual projects like Wikidata and Wikifunctions. عُثمان (talk) 12:23, 30 April 2024 (UTC)
The word چاچا (paternal uncle) has a vocative case inflection but the word نانا (maternal grandfather) does not - seems incorrect. Word0151 (talk) 12:39, 30 April 2024 (UTC)
@عُثمان Yes, pretty much all the inflection modules currently support various sorts of optional inflections, and it would not be hard to implement that in a Punjabi declension module. Benwing2 (talk) 08:15, 1 May 2024 (UTC)
(a) we should consider using Gurmukhi as the canonical script simply because it's likely to be more unambiguous than Shahmukhi, which (AFAIK) omits many vowel markings; number of speakers and official vs. non-official status seem like extra-linguistic variables that shouldn't play a role here; I will wait for more input though – I strongly disagree with this, both scripts are ambiguous in their own way, just as both offer clarity in their own way, and both can be codified. نعم البدل (talk) 15:53, 29 April 2024 (UTC)
@نعم البدل OK, can you supply a proposed solution then? As I mentioned above, there are political considerations here, and it is no surprise to me that Indians don't like canonicalizing on Shahmukhi while Pakistanis don't like canonicalizing on Gurmukhi. But we need a technical solution. It should be one of the following:
The simplest and cleanest is to use one script as the canonical source in all cases, and use etymid's whenever the canonical script merges the spelling of two lexemes that are separated in spelling in the other script. I suggested Gurmukhi because I believe it will result in fewer cases where etymid's need to be used; but I may be wrong here. We could certainly do an investigation to see which one will result in fewer etymid's.
The next simplest is to use one script as the canonical source except in cases where that script merges the spelling of two lexemes that are separated in spelling in the other script, in which case the other script assumes canonicality for those particular lexemes. This is definitely possible but may be a bit confusing because it won't always be obvious where the canonical version is located.
The next possibility is to literally split the difference; canonicalize the equivalent of A-M in one script and N-Z in the other. This is IMO the kind of compromise made to placate warring factions that ends up pleasing no one, but it's definitely relatively straightforward to implement and not too hard to determine where the canonical version of any given word should go.
The next possibility is to canonicalize randomly according to "whoever gets there first", a bit similarly to how things are done in English Wiktionary terms with distinct spellings. IMO this would be a disaster, would really end up pleasing no one, and could result in endless back-and-forth arguments about moving the canonical source.
The final possibility is what we have, which is to have no canonicalization and instead haphazardly duplicate the info across both scripts. This is by far the worst of all the suggestions; although it's also what we do for Serbo-Croatian, at least the Serbo-Croatian entries are mostly more or less in sync, while even a cursory glance at the Punjabi Gurmukhi and Shahmukhi entries shows they're wildly out of sync.
@Benwing2: I understand your concern. What I don't get is why there can only be one canonical script, what would that achieve? Module:pa-Arab-translit/sandbox requires some changing which I was going to request User:Sameerhameedy to do soon, but otherwise Shahmukhi can be unambiguous with vocalised text. Although, I'm not suggesting that Shahmukhi should be the only canonical script. I obviously don't have much knowledge about how the website operates, as much as you do, but could you clarify why both scripts can't be used?
I do have a proposal, but I just want to make sure it ticks as many boxes as it can before I explain it, but it involves using both scripts, not just one as the default script. نعم البدل (talk) 16:42, 30 April 2024 (UTC)
@نعم البدل The basic organizing principle here is to avoid duplication. What I want to avoid is having information on a given lemma duplicated in two places, because then the two places invariably get out of sync (as has indeed happened with Punjabi). Can you outline how your proposal achieves this? Benwing2 (talk) 21:29, 30 April 2024 (UTC)
@Benwing2: On second thought, I'm really not sure whether it's even possible to pick a single script, even if you look past the politics behind it.
The ambiguity in the Gurmukhi lies with the consonants nuqta (ie. راج / ਰਾਜ(rāj) vs راز / ਰਾਜ਼(rāz)) as well as an issue with Perso-Arab vocab (ض ز ظ ذ(ẓ z z̤ ẕ) etc.) – that's not an issue for Indian Punjabis since for them it's normal and they commonly mix j and z , but that's a huge difference for Pakistanis.
The ambiguity in Shahmukhi lies with the vowels and typically only occur with inflections or conjugations of lemmas. However, there can be potentially 3 different readings for one word. چَین(cain, “peace of mind”) / چین(cen, “chain”) / چِین(cīn, “China”).
My preference would be Shahmukhi, just because I believe that consonants shouldn't be ambiguous as they can change the entire context. That's less of a case with vowels. The only exception to that would be the letter ݨ(ṇ), since it's non-standard in Shahmukhi. People can usually interpret ambiguous vowels but consonants can be confusing. Plus, if Gurmukhi was the canonical script, then other languages couldn't be included (ie. سُکّھ(sukkh) would include both Urdu and Punjabi, but ਸੁੱਖ(sukkha) (or lemmas in the Gurmukhi script) would only be populated by Punjabi.
In any case, whatever the script, let's say Shahmukhi, we would need to redesign the Punjabi lemmas in a way which is friendly for both scripts, and I'm thinking of a complex design which is similar to Japanese/Chinese/Thai lemmas. Instead of a headword, a different template which can accomodate Punjabi much more accurately (for instance transliterations of both scripts). I will try some designs and see how they look. نعم البدل (talk) 23:17, 30 April 2024 (UTC)
@نعم البدل If we pick (e.g.) Shahmukhi as the canonical place, what would happen in cases like چین is that there would be three Etymology N sections, and the different Gurmukhi spellings would select the correct section either by including the vowel diacritics in the {{tcl}} invocation or by using |id=. So for example {{tcl|pa|چَین}}, {{tcl|pa|چین}} and {{tcl|pa|چِین}}, or alternatively {{tcl|pa|چین|id=peace of mind}}, {{tcl|pa|چین|id=chain}}, {{tcl|pa|چین|id=China}}. For these purposes the fact that the middle one has no vowel markings is a bit annoying but not insurmountable. If we pick Gurmukhi, it would be similar, but in the other direction. I'm not sure what your point 3 about other languages being included refers to; the {{tcl}} call takes a language code so it wouldn't be affected in either direction by having other languages share the same page. As for redesigning the headwords, I don't think that's necessary; each script's headword would just list the inflections relevant to that script, and would specify the other spelling as one of the inflections (similarly to how corresponding lemmas are handled in Hindi and Urdu). Same goes for the Declension or Conjugation sections; they would just list the inflections in the script matching the headword. Benwing2 (talk) 23:49, 30 April 2024 (UTC)
I should add, instead of using {{tcl}} we'd probably actually use a Punjabi-specific {{pa-tcl}} template, since the transclusion logic will have to be a bit different than what the generic template does. Benwing2 (talk) 23:51, 30 April 2024 (UTC)
@عُثمان I'm almost clueless when it comes to Wikidata but is it possible to somehow link Wikidata to this by adding the definitions of the Punjabi lemmas to Wikidata itself, and then import the definitions to the lemmas of the separate scripts? For instance lets say گَڈّی(gaḍḍī) and ਗੱਡੀ(gaḍḍī) are both linked to Wikidata item number '203879'. That item number would be linked to the headword, and subsequently the definitions expand, and whatever definition gets appended on Wiktionary, get appended to the Wikidata and vice versa?
That way multiple readings won't matter, the definitions would be the same and you can have multiple/different readings for both scripts. It would almost be like embedding or iframing the definitions from Wikidata but to Wiktionary for both scripts. نعم البدل (talk) 23:57, 30 April 2024 (UTC)
@نعم البدل Unfortunately it doesn't work that way. Lexical information doesn't get automatically transferred from Wiktionary to Wikidata. That would have to be done either manually or by some sort of offline bot script. Using Wikidata in this way would greatly complicate the whole process and I don't think it would be an improvement. Note that {{tcl}} doesn't use Wikidata, except that sometimes Wikidata ID's are used as the |id= parameter (which is otherwise arbitrary and can be anything you want, really, as long as it's unique). Benwing2 (talk) 00:52, 1 May 2024 (UTC)
Regarding English Wiktionary and Punjabi lemmas, a relatively uncontroversial idea would be to pull quotes from the "gloss quote" property which are sourced from Maya Singh's Punjabi–English dictionary, which is public domain and was compiled before partition and includes Gurmukhi spellings for a number of words now mainly only used in Pakistan. For example, https://www.wikidata.orghttps://dictious.com/en/Lexeme:L1012820 has a quote "An owner, a proprietor" which would work as a definition on English Wiktionary.
Regarding your concerns @Benwing2 for what it is worth, users on English Wikimedia projects in general have been more resistant to utilizing Wikidata, but various non-English projects have been implementing integrations with it more or less successfully. It ultimately depends on what contributors are willing to do. For Punjabi, there is not going to be any solution which is easy or convenient for everybody and نعم knows well that Punjabi contributors are going to have disagreements with the way things are done no matter what. عُثمان (talk) 11:36, 1 May 2024 (UTC)
@عُثمان I took a look at the code that integrates the Bengali Wiktionary with Wikidata. Did you write this code? If not, who did? The main question is, how is the information at Lexeme:L301993 getting there? Currently I see no benefit to having such information in Wikidata instead of in the English Wiktionary directly; IMO it's an extra level of indirection that is likely to reduce our editor contribution as it's harder to input than directly into Wiktionary. However, I could change my mind if I have a clearer picture of the benefits. Benwing2 (talk) 21:50, 1 May 2024 (UTC)
@Benwing2: It would definitely make it harder for normal Wiktionary users to contribute to Punjabi lemmas, plus there are some other issues we'd have to discuss regarding Punjabi with @عُثمان, but if we want to achieve consistency with Punjabi lemmas, then there will have to be centralised domain with the definitions that the respective scripts can retrieve from, and I don't really see how that could be achieve without something like Wikidata. نعم البدل (talk) 17:51, 2 May 2024 (UTC)
@نعم البدل It's not necessary to use Wikidata. Wiktionary code can read directly from any page in Wiktionary, so it's easy to read the definitions directly from wherever in Wiktionary we put them. The problem with Wikidata is that it uses a totally different structure from Wiktionary for its definitions, and so users would have to learn that structure only for Punjabi, whereas all other languages use the normal Wiktionary structure. Benwing2 (talk) 19:04, 2 May 2024 (UTC)
@Benwing2 The primary advantage of utilizing Wikidata (and part of the project’s purpose) is that it can be used by any Wiktionary project and not rely on the data being copied between them, and on other Wikimedia projects. Using it would be supplementary, and would not prevent users from also contributing to Wiktionary projects directly.
The code on Bengali Wiktionary was written by User:Mahir256 and uses a Lua module to access the Wikidata API.
I have had a similar discussion with one of the active Swahili contributors—I do think this approach has advantages for some languages more than others, and it may become more apparent where it is helpful should contributors to information on other language also express interest in this.
This discussion overall has been encouraging though, and the other suggestions here would definitely be improvements. عُثمان (talk) 13:19, 5 May 2024 (UTC)
@Benwing2 I'm of the opinion that the last option, messy as it may be, is the best. Granting the lemma status to one script over the other will be controversial. If Shahmukhi script is to be made the main script, it'll discourage Indian Punjabi speakers from contributing. The out of sync issue happens with Hindi/Urdu as well, but we have managed that for years. It isn't perfect, but then again, what is? -- 𝘗𝘶𝘭𝘪𝘮𝘢𝘪𝘺𝘪(𝘵𝘢𝘭𝘬)07:24, 2 May 2024 (UTC)
@Pulimaiyi IMO the status quo of messy duplication is significantly worse than having a clean dictionary with possibly some terms missing because some Indian Punjabi speakers may be discouraged from contributing. I don't actually believe in any case there will be as much of this as may be feared. The situation with Hindi and Urdu isn't quite the same because (a) there are a lot more contributors, so the extra effort of keeping the two lemmas in sync might actually get done, (b) there are (or may be) cases where a Hindi term actually has different meanings from its corresponding Urdu term, or where a Hindi term just doesn't exist in Urdu or vice-versa. (Possibly the latter happens in Punjabi as well but I imagine less.) As I mentioned above, if we e.g. choose to lemmatize on Shahmukhi for the shared terms, we can still choose to lemmatize on Gurmukhi for the India-only terms. But if you actually take a look at the current situation for Punjabi, the status quo is really bad; most entries are wildly out of sync, which wrongly implies that there are lexical differences between the paired terms. This results in a simply *WRONG* dictionary, which IMO is a lot worse than having fewer but correct entries (or even having no entries at all). Benwing2 (talk) 07:46, 2 May 2024 (UTC)
@Pulimaiyi Note that there is yet another option to appease the political sensitivities of Indian and Pakistani contributors, which is to lemmatize at the Roman transliteration. We can choose a transliteration that captures all the information inherent in both scripts, and lemmatize there; that way neither group will feel that they are slighted by having the lemma stored using the other group's script. If this is what it takes to get agreement among Indian and Pakistani contributors, I am fine with it. Benwing2 (talk) 07:55, 2 May 2024 (UTC)
why can't the shahmukhi and gurumukhi scripts both be included in the same entry? and entry will appear in search box irrespective of the choice of script. I guess it is technically not possible in mediawiki software. Word0151 (talk) 08:57, 2 May 2024 (UTC)
@Word0151, the point being which script would they appear under. For instance should the definitions of 'car' appear under ਗੱਡੀ or گڈی? One script would have to be favoured over the other. نعم البدل (talk) 17:46, 2 May 2024 (UTC)
@Benwing2: I really don't want to disagree with the few options there are, but I vehemently oppose lemmatizing at the Roman transliteration, based on the fact its Roman, alone, and you'll find, hopefully, that other editors are of the same opinion. The transliteration itself would be something that becomes a point of contention, considering Shahmukhi and Gurmukhi are both completely different scripts. We should be encouraging native scripts, regardless of the issue of multiple scripts, not start promoting Roman. نعم البدل (talk) 17:44, 2 May 2024 (UTC)
@نعم البدل IMO this is not promoting using Latin/Roman script for Punjabi but just finding a technical solution that will avoid political sensitivities. All definitions and inflections will still show under the respective Shahmukhi and Gurmukhi pages; only the editors will see the difference, not the users. But I am not wedded to this solution; any technically feasible solution is fine with me, and I don't have any particular preference as I'm neither Indian nor Pakistani. Benwing2 (talk) 19:10, 2 May 2024 (UTC)
@Benwing2: Perhaps I'm misinterpreting it? I'm under the impression that your proposal would turn the Punjabi language into the Pali language, where all definitions would come under the Roman terms, and it would appear under the respect scripts as "Gurmukhi/Shahmukhi form of ...", is that so? I mean, if it means that we can synchronise the Shahmukhi/Gurmukhi lemmas, then it's not the end of the world and I would agree to it, but it would have to be at the bottom of the list when it comes to preferences. It does make sense "neither yours, neither mine".نعم البدل (talk) 19:40, 2 May 2024 (UTC)
@نعم البدل What I'm proposing is to put the canonical text under the Latin/Roman script, but then use {{pa-tcl}} to *auto-copy* the definitions under the Shahmukhi and Gurmukhi script headers, so it would appear to the user as if the text were directly on that page. It's similar to what's done with Wikidata on the Punjabi Wiktionary except that the canonical source of the text is on English Wiktionary (under the Latin script header) rather than in Wikidata. It wouldn't have anything like a "Gurmukhi/Shahmukhi form of ..." (a soft redirect), as is done for Pali. Benwing2 (talk) 19:51, 2 May 2024 (UTC)
@Benwing2: That's favourable, and it would probably be the most neutral way imo. You would have to discussion the Roman/Transliteration scheme, but other than that, it's a great way to sync Punjabi lemmas definitions. Would TCL work both ways (ie. would users be able to append or remove definitions from the Gurmukhi/Shahmukhi and it would sync it with the Roman/opposite script)? نعم البدل (talk) 02:07, 4 May 2024 (UTC)
@نعم البدل Unfortunately, no. All the definitions would have to be placed under the Roman script entry, and would be auto-copied into the Gurmukhi and Shahmukhi entries. If we were to choose Gurmukhi or Shahmukhi as the canonical source of definitions, it would be similar; all the definitions would have to be placed under that script, and they would be auto-copied to the other script. As for what transliteration scheme to choose, the most important thing is that it captures all the distinctions made in both scripts. In other words, if terms A and B are written differently in either script, they need to be written differently in the transliteration. Other than that, the exact details aren't so important but it should be as logical as possible for native speakers to use, and should ideally match as much as possible the actual transliteration scheme displayed to users when transliterating Shahmukhi and Gurmukhi entries (otherwise it will be confusing for editors). Benwing2 (talk) 02:52, 4 May 2024 (UTC)
The issue I'm calling out with Shahmukhi is the vowel representation, which appears to have lots of ambiguities in it
Yes, I was referring to vowel representation rather than consonants when I said Gurmukhi is…usually unambiguous…Shahmukhi…can be ambiguous. For a module (or a person not used to reading Shahmukhi), it may be unclear what the short vowels are in a word without the diacritics, which are usually omitted. For the sake of comparison, Gurmukhi could be considered to be comparable to the Devanagari script for Hindi, while Shahmukhi is almost the same as the script used for Urdu.
See this paper:
Title: Orthographic characteristics speed Hindi word naming but slow Urdu naming: evidence from Hindi/Urdu biliterates
Abstract: Two primed naming experiments tested the orthographic depth hypothesis in skilled biliterate readers of Hindi and Urdu…Hindi is a highly transparent script, whereas Urdu is more opaque.
The orthographic shallowness of Hindi appears to encourage reliance on systematic grapheme to phoneme conversion even among highly skilled, adult readers. By contrast, skilled readers of Urdu exhibit little influence of form-based and phonological primes…Further, both experiments in the current study furnish evidence for a processing cost levied by Urdu orthography—despite being native readers of Urdu, our participants were slower and less accurate at responding to words presented in Urdu than the same words in Hindi script. This pattern is attributable to the greater graphemic complexity of Urdu orthography, and is reminiscent of a similar finding reported for Arabic (Eviatar & Ibrahim, 2004; Ibrahim et al., 2002), whose orthography forms the basis for Urdu script.
@Kutchkutch Both Punjabi and Urdu dictionaries use diacritics in headwords. However, short vowel distinctions do not differentiate inflectional suffixes in Urdu while they do in Punjabi. دیاں and دِیاں have different readings in Punjabi but would be read the same way in Urdu. عُثمان (talk) 11:36, 29 April 2024 (UTC)
My 2c, regarding which script to "canonicalize": the fact that more than twice as many people use Shahmukhi than use Gurmukhi (and hence, we may expect that most people editing Punjabi entries will be using Shahmukhi) means that choosing to canonicalize Gurmukhi and just {{tcl}} things onto Shahmukhi entries would be unhelpful for a greater number of editors than the reverse (putting content at Shahmukhi and {{tcl}}ing it onto Gurmukhi). How many words are ambiguous in Shahmukhi, vs how many are ambiguous in Gurmukhi? Are we only talking about, say, 1% of words in the language, or are we talking about 25%? Unless the amount of ambiguity is both large and lopsided (say, if 25% of words are ambiguous in Shahmukhi and only 1% are ambiguous in Gurmukhi), I would not consider that to even come close to outweighing the benefit of "canonicalizing" the script that more than a supermajority (close to 71% if my math is right) of speakers use and hence would be most likely to edit. That's my 2c. I do like the idea of transcluding the definitions from whichever form we "canonicalize" to the other form (and I would support doing this for Serbo-Croatian, too, if anyone wants to start a poll about that later...). - -sche(discuss)18:43, 30 April 2024 (UTC)
@-sche Thank you for your opinion. What you say indeed makes sense; my concern has been that Shahmukhi is in fact significantly more ambiguous than Gurmukhi, but I don't actually know. If we're able to make use of Shahmukhi short vowel diacritics, potentially this will resolve the issue of ambiguity. As for Serbo-Croatian, by all means yes we should use a {{tcl}}-type solution. I think in the case of Serbo-Croatian, (1) we need to use the ijekavian standard because in general you can map ijekavian -> ekavian but not the other way around; (2) the two scripts map nearly one-to-one exactly but Cyrillic may be slightly less ambiguous because it has single letters њ and љ corresponding to the Latin digraphs nj and lj, which (when combined with the ijekavian standard) means you can unambiguously map Cyrillic -> Latin but not the other way around. For these reasons I'd support canonicalizing on Cyrillic ijekavian, although using your number-of-speakers argument we should maybe prefer Latin because Croats and Bosniaks use it exclusively while Serbians use both (and according to Wikipedia, view the Latin script as more neutral, so the use of the Latin script is increasing while the use of Cyrillic is decreasing). Note that there is precedent for choosing what's technically better vs. number-of-speakers better with Chinese, where we consistently lemmatize at the traditional form and soft-redirect simplified forms to their traditional form despite the number-of-speakers argument going overwhelmingly in the other direction. Benwing2 (talk) 20:54, 30 April 2024 (UTC)
(1) we need to use the ijekavian standard because in general you can map ijekavian -> ekavian but not the other way around – If you apply this to Punjabi, the I am positive that we would have to use Shahmukhi and not Gurmukhi as the canonical script, since you can transliterate the vowels from Shahmukhi to Gurmukhi and clearly any ambiguity, but consonants which use the Arabic letters in Perso-Arab loanwords in the Shahmukhi script cannot be interpreted based on the Gurmukhi lemma alone. نعم البدل (talk) 17:58, 2 May 2024 (UTC)
@نعم البدل After reading the previous points that you have made, I do agree that both scripts (even Gurmukhi) have their own shortcomings. Also, if I understand this analogy to Serbo-Croatian correctly, I can see how Shahmukhi and not Gurmukhi may need to be the canonical script if a single native script has to chosen (instead of the romanisation proposed above). The diacritics would have to be compulsory, and the manner in which the diacritics are used would have to be clear. Kutchkutch (talk) 06:17, 4 May 2024 (UTC)
shikshapatri shloka
Well it should be noted what kind of absurd linguistic claims this particular user holds and the kind of edits he makes. نعم البدل also seems to edit Punjabi entries, but both of them can be a little biased. Word0151 (talk) 17:11, 28 April 2024 (UTC)
@Word0151 There is nothing absurd about the edit you linked, and I am not sure what bias(es) you think I have that would affect the issue at hand. I would be happy to discuss any particular issues or claims but in this case tobacco was introduced to the Indian subcontinent after the end of Sanskrit, and that fact is not related to Punjabi at all. عُثمان (talk) 19:20, 28 April 2024 (UTC)
@Word0151 For this, I would request a custom language code for “Shikshapatri Bhashya” to represent the Gujarati-mixed Sanskrit register used in this work. There are multiple Sanskrit or Sanskritized liturgical registers which each have their own vocabulary and word senses. The New Indo Aryan words which appear in the Sahaskriti of the Adi Granth would not necessarily have the same meanings or forms as those which appear in the Shikshapatri. Again however, not related to the topic of this thread at all. عُثمان (talk) 21:11, 28 April 2024 (UTC)
We have two pages listing Wiktionary languages: WT:LOL (which lists all L2 languages) and WT:LOL/S (which lists all etym-only languages plus various L2 languages meeting certain "special" criteria). When I encounter a particular lect and don't know whether it's an L2 or etym-only language, it's annoying to have to search in two places. I propose expanding WT:LOL to include both L2 and etym-only languages. This should not significantly impact the size of WT:LOL because there are 8,218 L2 languages and only 548 etym-only languages currently. Benwing2 (talk) 00:10, 26 April 2024 (UTC)
@Thadh @Vininn126 Yes I was thinking of putting them under a separate header and including a column for the parent language (I think that's what Vininn was requesting?). Benwing2 (talk) 19:30, 26 April 2024 (UTC)
Something along those lines, just to make it clear what it is an etymology language of (I hope that syntax makes sense...). Vininn126 (talk) 19:32, 26 April 2024 (UTC)
People often put things like archaic, figurative, colloquial and the like in qualifiers using {{q}}. Earlier I proposed adding an {{lq}} template that took a language code and processed labels just like {{lb}}, but without categorizing. I'm coming round instead to the idea that what we should do is this:
Add a language code to {{a}} and make it work like {{lq}}, which won't then be needed (i.e. process its parameters like labels, but don't categorize);
make {{q}} process its parameters like labels, but only for lang-independent labels, and don't categorize. Currently all {{q}} does is display its stuff in italics surrounded by parens, but with this change it would auto-link terms like archaic, figurative and colloquial to the appropriate glossary entry, and would canonicalize certain labels unless preceded by an ! (which forces a label to display as-is, but still allows it to be linked appropriately). For example, {{q|hapax}} would display as (hapax legomenon) and be linked to the glossary, {{q|historic}} would display as (historical) with a glossary link, etc. It would also auto-recognize special parameters like _ and ;, just like {{lb}} does.
Thoughts?
If people think we still need a simple "just italicize and add parens" template, we could use {{i}} for that purpose; currently {{i}} is an alias of {{q}} but they could be split, so that {{q}} does label processing and {{i}} doesn't. My instinct is this isn't necessary, but it is a possibility if people think it may be needed in certain cases. Benwing2 (talk) 07:49, 26 April 2024 (UTC)
This would be nice, and I agree better than a new template. Some might people might be upset about the functionality change - either {{i}} could be made to do take on the old functions, or I suppose the new ones. Not sure what the downsides are. But I think synching these things up would still be good. Vininn126 (talk) 07:55, 26 April 2024 (UTC)
The point of a qualifier is to point you towards an entry where the term will already be explained. Saving the one click you need in any case for all the other information that is on that page is just not worth it in my opinion. I would probably be fine with a tooltip, but links in my opinion are too distracting. Thadh (talk) 11:39, 26 April 2024 (UTC)
In the past, {{q}} was used to just present the text without parsing or expanding it. If you need to prop a door open, a rock will do just fine- no need to determine its exact location to the micrometer, the ambient temperature or the direction of the forces acting on it. I see that it now uses a module, which seems silly. I'm sure that the 121 instances of the template at "a" are contributing to the fact that it keeps drifting in and out of CAT:E. I would prefer to revert this template to its previous dumb state, and create a separate lua-powered version for more specialized things. Given its history, I would worry about retroactively changing the display in thousands of entries in subtle ways that would require manual checking to spot, as well as tripping up contributors who have been using this template for eons and have no reason to check the documentation to discover that it's been changed. Don't get me wrong- there's definitely a place for the Swiss-Army-knife approach- but we should have a few exceptions set aside, just in case. I don't think anyone wants a template used on every single character, with a separate data submodule for each codepoint, but we should be careful to avoid drifting too far in that direction, anyway. Chuck Entz (talk) 18:10, 26 April 2024 (UTC)
Revisiting the Deleter role proposal
I'd like to revisit the proposal that was voted down in Wiktionary:Votes/2021-12/Deleter role. Clearly whatever we have right now when it comes to the timely deletion of entries found at Category:Candidates_for_speedy_deletion is not working, if there are words there from last year that have still not been deleted. We shouldn't have 209 pages currently listed there. It makes the enforcement of rules and policies like WT:DEROGATORY, RFV, & RFD useless if users know that their entries won't be deleted in the end. I've had to start resorting to blanking entries so that the unverified information at the very least won't be there to stay. As such, I do think that it'd be helpful for a deleter role, so that these entries can be dealt with in a timely manner, barring increased activity from current admin. AG202 (talk) 14:29, 26 April 2024 (UTC)
@AG202: On some of the pages that you marked for speedy deletion, was there ever an RFV discussion? From what I understand WT:DEROGATORY only means that the RFV discussions should be shorter, not that any derogatory term can be marked for deletion without discussion.
On the more general topic: The issues as outlined in the original vote didn't disappear. If anything, most if not all of the people who would ever qualify for the "deleter role" should be made admins. Thadh (talk) 14:39, 26 April 2024 (UTC)
@Thadh: No, they do not need to be sent to RFV, provided that it's been within 2 weeks of entry creation. Per WT:DEROGATORY, I marked the entries, such as 13%, with the derogatory template before the 2 weeks was up. However, cites were not added before the deadline, so the entry was marked for speedy deletion. This is how the policy was understood to work at the vote, and how it's been implemented, until the recent lag in deletions. AG202 (talk) 14:55, 26 April 2024 (UTC)
@AG202: So basically nobody sees the entry anywhere except the creator and the one who marks it for deletion? That seems a bit overly strict, that just means IPs can't create derogatory terms without immediately adding quotes basically. Thadh (talk) 17:14, 26 April 2024 (UTC)
The wording at WT:DEROGATORY reads to me as AG202 explained it, and the discussion at the vote also said a motivation for the policy was to avoid cluttering RFD or RFV or wasting editor time and attention. What the policy is seems pretty straightforward to me (whether it is 'too strict' or not is a different matter, but admins should simply carry out the policy unless it is revised through consensus). As for a deleter role, I don't think there should be one, as I don't like the idea of giving special deletion powers to editors who wouldn't be suitable admins.--Urszag (talk) 17:24, 26 April 2024 (UTC)
The wording per Wiktionary:Votes/pl-2022-06/Attestation criteria for derogatory terms reads to me that the quotes need to exist and only if the term is nominated for being of dubious language permeation (criteria accepted after Wiktionary:Votes/pl-2022-01/Handling of citations that do not meet our current definition of permanently archived) then deletion is sped up by a deadline rather than the timeframes afforded to less offensive words, so offensive IPs are deterred to create their words in the first place because from motivational psychology we know rewards or punishment need to follow with not too great a time lag, whereas taking space is what certain kinds of trolls desire. You can’t point the finger at your mutt pooping your carpet and scold him a “bad dog” one month after the offence.
Editor time and attention must be “wasted” either way to some degree, and the terms are actually more likely to be, and more legitimately, deleted by an admin if you posted a banns ascertaining him of the matter. Legitimacy comes through participation, through procedure, you know it. Fay Freak (talk) 21:25, 26 April 2024 (UTC)
I mean, that's exactly what we talked about at the vote. We had a problem with IPs creating derogatory terms and them cluttering RFV/RFD. This was created to help limit that. AG202 (talk) 17:38, 26 April 2024 (UTC)
To me it seems odd to have a group of users who are simultaneously trusted enough to delete pages but not trusted enough to become admins. Do you have yourself in mind for the role? In any case, I agree that we need more admins patrolling Category:Candidates for speedy deletion. Ioaxxere (talk) 19:19, 26 April 2024 (UTC)
Hasnt the WMF explicitly said that they will absolutely not allow any non-admins to view deleted pages, for security reasons, on any project? Did that only apply to Wikipedia? Or am I just mis-remembering altogether? This was so long ago that I'm not sure I could find what I read. —Soap—12:32, 27 April 2024 (UTC)
i found this from 2008, though it may have been a refusal to open access to deleted pages for editors as a whole rather than for some specific group. —Soap—12:35, 27 April 2024 (UTC)
I honestly think we need to split up the admin role based on what tasks they do. Some might be better at dealing with vandalism than banning, for example. I support this proposal. CitationsFreak (talk) 01:44, 28 April 2024 (UTC)
I personally am very reluctant to delete things in CAT:CSD without careful checking, especially where I don't speak the language involved. Cleaning out the category is tiresome for this reason. Moreover, occasionally pages will be marked for speedy deletion by users who have a poor track record, and I am reluctant to act on these requests.
I think the solution is actually to rename the category to "Candidates for imminent deletion". Then we can maybe have a separate speedy deletion category for incontrovertible rubbish - pages that could never be legitimate - like promotional user pages from non-contributors, entries consisting of just keyboard smashes, etc. This, that and the other (talk) 10:13, 28 April 2024 (UTC)
Oh, there's another class of tricky cases when processing CSDs, exemplified by sàng-kèe-ḿ--á. A user creates a legitimate-looking entry, and then a different user comes along (in this case, all of 10 minutes later) and marks it with {{d}}. I am very reluctant to delete such an entry without hearing the entry creator's side of the story. This, that and the other (talk) 10:31, 28 April 2024 (UTC)
I had changed the Kyrgyz transliteration module Module:ky-translit to a simpler transliteration system that uses some of the Common Turkic Alphabet's letters differently. The reason is to declutter the text from excessive diacritics and simplify the alphabet.
I am currently working on the Module:ky-IPA to help avoid confusion that might've been caused by my change. And nobody else seems to be working on the Kyrgyz language as of right now, so I am not interfering in anyone's work. Bababashqort (talk) 05:53, 27 April 2024 (UTC)
Hello! The previous ky-translit was transliterating the words in accordance with the Common Turkic Alphabet, however this made some words look very unfamiliar. For example, жылдыз was rendered as cıldız, while the more convenient and simple variation would be jıldız. Sure, that probably needed some extra learning (that is, only being the usage of letter C for Ч), and that's why I made the IPA module, to have both the translit and the IPA reading and to familiarize with the new translit system. I think that such system is much more convenient for Kyrgyz.
This caused no problems so far, especially considering that the correspondence chart is available at Wiktionary:Kyrgyz transliteration, and that my chart hasn't been edited since December 2023 without being reverted back to my version, so I wasn't interfering in anyone's work.
@Bababashqort: Nothing convenient and simple in your example. j being /d͡ʒ/ is specific to English. It could easily be mistaken for /ʒ/, if not /j/. Some Arabic transcription systems use ǰ, j with háček, which should be unambiguous and doveteils with /t͡s/ being č. A mere c is annoying (could be /t͡s/) and currently limited to Indo-Aryan languages; can be the voiced pharyngeal fricative as in Somali, and the ejective palatoalveolar affricate /tʃʼ/in Oromo. It is also contradictory to claim that adherence to the Common Turkic Alphabet makes words look “unfamiliar”. You weren’t interfering because nobody did Kyrgyz; when somebody does occasionally then he expects transcription to be as known from other Turkic languages. Fay Freak (talk) 17:49, 1 May 2024 (UTC)
That's precisely why I also made an IPA module, but if the collective decision is against it, I am ok with a revert of my changes. Bababashqort (talk) 09:28, 3 May 2024 (UTC)
@Benwing2, RcAlex36 Are the English language terms Hong Kong and Xianggang alternative forms or synonyms? I can go either way. But I think in the majority opinion on Wiktionary, it's likely near impossible to say they are alternative forms. So if that's true, then let's review recent edits concerning Waichow/Huizhou and Yeungkong/Yangjiang. What happened there with the edits from WingerBot? If Xianggang and Hong Kong are synonyms, why aren't Waichow and Huizhou synonyms? Check my logic and extend it to all other relevant cases. Thanks! --Geographyinitiative (talk) 14:34, 27 April 2024 (UTC)
I now realize this is perhaps too complex an issue for many editors to fully grasp ("it's all Chinese to me"). I'll go ahead and start repairing the accidental damage from the WingerBot edits as I come across it, and just let me know if you disagree with anything I do. I think this may not have been intentional, so I'll just try to correct things as I see it. Here's the corrections of Waichow/Huizhou-- Correction and Yeungkong/Yangjiang-- Correction. Dairen is not an alternative form of Dalian, it is a synonym. Xianggang is not an alternative form of Hong Kong, it is a synonym. Waichow is not an alternative form of Huizhou, it is a synonym. Yeungkong is not an alternative form of Yangjiang, it is a synonym. Amnok is not an alternative form of Yalu, it is a synonym. --Geographyinitiative (talk) 17:27, 27 April 2024 (UTC)(Modified)
@Geographyinitiative My mistake, I made some assumptions that turned to be false in some cases. All of these were cases where the bot's only contribution was to push the changes I made manually (that's what "manually assisted" means). I can point you to all the edits I made in that particular run, if it would help. Benwing2 (talk) 18:10, 27 April 2024 (UTC)
I apologize. I am very sensitive on these things because I feel that Wikipedia and Wiktionary have had disastrous coverage of this area of terminology over the past 20 years. I am also constantly afraid to be banned because of my bad personality. Thank you for your kind help. Geographyinitiative (talk) 18:18, 27 April 2024 (UTC)
For Wade-Giles (in some cases it looks like I changed to use {{alt}} without changing the header from ==Synonyms== to ==Alternative forms==, oops): Malan, Wenzhou, Anning, Long County. I can't find where I changed Dairen, Xianggang or Amnok to Alternative forms, maybe they were there before? Benwing2 (talk) 18:26, 27 April 2024 (UTC)
problem adding translations into Mandarin
I've been encountering a problem when entering translations into Mandarin (of English entries). It comes up with this error message: "Please use a valid script code. Available script codes for this language are Hants, Latn, Bopo." But if I put in "Hants" as the script code it says "Please use a valid script code(e.g. fa-Arab, Deva, Polyt)". Does anyone have any idea what's going on here? ---> Tooironic (talk) 21:55, 27 April 2024 (UTC)
Currently, the only internal distinction we make in Nivkh are geographic, being put into "Sakhalin" and "Amur" groups, which is an oversimplification and genetically inaccurate. Per Fortescue (2016), the Eastern Sakhalin and Southern Sakhalin varieties are not mutually intelligible with the Amur variety; per Gruzdeva (1998), the Northern Sakhalin and Amur varieties are most closely related, with the Eastern Sakhalin variety mutually unintelligible with the Amur variety and the Southern Sakhalin variety even further removed from all three. At the very least, Nighvng (Southern Sakhalin, Eastern Sakhalin) should be split of off Nivkh (leaving Northern Sakhalin and Amur), which is the classification that Glottolog uses. -saph 🍏16:58, 28 April 2024 (UTC)
Language splits don't need votes; A simple poll here will be enough. Unless there are any counterarguments, this can be resolved within a few weeks. Thadh (talk) 17:57, 28 April 2024 (UTC)
I have no objection to splitting into Nighvng and Nivkh. I notice some lemmas are labeled only as "Sakhalin" (see CAT:Sakhalin Nivkh) and others aren't labeled. Someone should go through the lemmas and label them all as one or more of "Amur", "North Sakhalin", "East Sakhalin" and "South Sakhalin"; after that, it should be fairly easy to split them, the same way we split Khanty and Mansi. Benwing2 (talk) 23:34, 30 April 2024 (UTC)
I support this again. Not sure about the name though, as Nighvng doesn't really see much any use in English, though it is the endonym for East Sakhalin Nivkhs (idk about South Sakhalin, which is no longer spoken).
One issue with labeling varieties I've come across is that the Taksami 1983 dictionary isn't explicit about it. He includes the following note: "При заглавных словах амурского диалекта через запятую без учета алфавита приводятся соответствующие по значениям слова восточносахалинского диалекта и других городов нивхского языка." (essentially meaning "along with the headwords of the Amur dialect, words with corresponding meanings from the East Sakhalin dialect and other cities are given, separated by a comma and without taking alphabet into consideration".) So in any given entry it’s not clear if the word(s) after the comma is/are from the East Sakhalin variety or from "other cities where Nivkh is spoken". In such cases I always marked those words as just "Sakhalin". (Example: вурдь, вурд, выйзд – the first is clearly Amur, the second is almost certainly East Sakhalin, but the third? who knows – thus simply "Sakhalin".)
I also have two dictionaries by Gashilova that I know are East Sakhalin, and a few other materials where I should be able to determine which variety they are. But ultimately I don't know what to do with the Taksami data, as I don't want to assign forms to "Nighvng" if they're actually from North or West Sakhalin. Dylanvt (talk) 22:28, 10 May 2024 (UTC)
Nighvng is the name used by Gruzdeva.
Could we corroborate the Taksami words by looking at other dictionaries to see where they're placed? -saph 🍏22:32, 10 May 2024 (UTC)
That's exactly what I wanted to say, too. But, maybe It's not a bad idea to start using it too (and probably update Wikipedia article). Though the issue is although idioms are unintelligible, Nivhks consider themselves one people (correct me if I'm wrong), so it does bring some degree of confusion. Is it dealable? Seems like a question of habit. Kaarkemhveel (talk) 07:56, 11 May 2024 (UTC)
Incorrect hyphenation patterns - help requested from editors
I've written some Python scripts to check that the hyphenation patterns expressed in {{hyph}} match the actual word (a·b·c matches abc, a·b·x does not). This is often not the case because of misspellings or copy&paste errors. (There's also some ambiguity on how to use this template, which I'll raise separately.)
Of course there are complications because words can legitimately change during hyphenation (for example, "ck" became "k·k" under old German rules and some languages appear to lose (or even gain?) diacritics). I've tried to create hyphenation rules for some languages but improvements and additions are welcome.
In addition to {{hyph}}, there's {{pl-p}}, {{fi-p}}, {{es-pr}}, {{it-pr}} and probably others I missed (pointers welcome).
I've created page User:Tbm/QA/Hyphenation listing discrepancies for some languages. I'd appreciate if editors of the listed languages could take a look and fix any genuine problems. If some words listed there are correct, please leave a message on my talk page so I can fix my scripts.
Thank you! (And thanks to Vininn126, Surjection and Tollef Salemann for already helping with Polish, Finnish and Norwegian, respectively). tbm (talk) 07:00, 29 April 2024 (UTC)
I've gone ahead and fixed all the Dutch entries, except qowed since I have no official reference as to how it should be hyphenated.
I'm not sure what the right solution is (although the last two seems best to me). I see possible options:
Retain whitespace or break at whitespace (the last two options)
Introduce "_" to mark whitespace explicitly
Don't list hyphenations for such words. I mentioned this issue one Discord once and someone thought such phrases should not have hyphenation information at all -- only the individual words should. But this doesn't work for e.g. Herceg Novi because there's no entry for "Herceg".
2) Case-mismatch: sometimes the case doesn't match, especially the first letter of the hyphenation is lower case when the entry starts with an upper case. Is this okay or should that be fixed? (I fixed some of those and it's easy enough to do.)
3) Dash: what about words that contains dashes?
Some hyphenate at the dash (which seems most correct to me) e.g. dix-huitième and קרעבס־עסער(krebs-eser)
Some replace dash with space (why?) e.g. Flandre-Occidentale (the space is not shown)
I hope we can discuss and reach consensus on how to use {{hyph}} in these cases and update the documentation and entries accordingly. tbm (talk) 07:28, 29 April 2024 (UTC)
First of all one should be careful if the intention was hyphenation (orthographical) or syllabification (phonological). I don't think syllabification should exist for multiword terms, nor hyphenation. Case should match the entry name. Dashes are in a weird space for me and I wonder what others thing. Vininn126 (talk) 07:36, 29 April 2024 (UTC)
@Vininn126 I agree with you about not syllabifying or hyphenating multiword terms. Same goes for adding rhymes. When I wrote Module:es-pronunc, I made it so that rhymes aren't added to (a) multiword terms, (b) hyphenated terms, (c) prefixes, (d) unstressed suffixes, (e) words without vowels (e.g. single letters). For syllabification/hyphenation of hyphenated terms, it's a bit less clear, esp. when one of the parts isn't a full word (e.g. Austro-Hungarian), and in some languages, hyphens serve purposes unrelated to their standard use of joining words into a compound (e.g. in some Philippine languages, a hyphen following a consonant and preceding a vowel indicates a glottal stop rather than a component boundary). Benwing2 (talk) 23:16, 30 April 2024 (UTC)
@Benwing2 I think @Surjection would agree about rhymes on multiterm entries, and even go so far as to say there shouldn't be pronunciation on them generally. I can see the argument; I think there's more value for languages with a high degree of morphology, personally. Vininn126 (talk) 05:08, 1 May 2024 (UTC)
@Vininn126 I find multiword pronunciations useful for various reasons, e.g. there are unstressed words whose pronunciation wouldn't be obvious from their single-word pronunciation, and there are various sandhi phenomena in certain languages such as liaison, elision, enjambment and schwa deletion in French; syntactic gemination in Italian and Finnish; clitic stress-stealing in various Slavic languages; initial mutations in all modern Celtic languages; etc. Some of these phenomena are quite complex and hard for people to work out even if they have an exhaustive description of them. Even in English there are weird cases like the pronunciation of "Spanish teacher", where the meaning shifts depending where the stress is (a "Spanish teacher" is someone who teaches Spanish, while a "Spanish teacher" is a teacher who happens to be Spanish), and the pronunciation of high school, which (in my dialect) sounds like /ˈhɐɪs.kul/ referring to a type of school but /ˈhɑɪ.ˈskul/ in its SOP meaning of "school that is high (e.g. in altitude)". Benwing2 (talk) 05:25, 1 May 2024 (UTC)
I'd agree on that point, as well. I also think even some predictable interword processes could sometimes be useful. Again probably a discussion for a different thread - but it seems generally syllabification/hyphenation is off the table. Vininn126 (talk) 05:27, 1 May 2024 (UTC)
Re the suggestion of not hyphenating multiword terms: I assume that is only in cases where the individual words (and their hyphenations) exist in the language and thus can be looked up, whereas we would still need to give hyphenation when that is not the case? For example, if enjo and kosai don't exist in English except in the phrase English enjo kosai, then the hyphenation would need to be given there. In such a case, I would just keep the space like tattie cake; this seems like the clearest approach to me, compared to velika nužda. - -sche(discuss)18:40, 7 May 2024 (UTC)
The code in Module:es-pronunc (which I wrote awhile ago, and more recently borrowed for use in Module:tl-pronunciation) does actually auto-hyphenate words with spaces in them, and does it like tattie cake. What it doesn't do is generate any rhymes for multiword terms, although User:Ysrael214 and I have agreed on displaying rhymes for multiword terms in Tagalog (based on the last word), but not categorizing them. Benwing2 (talk) 21:42, 7 May 2024 (UTC)