Hello, you have come here looking for the meaning of the word User talk:Theknightwho/2023. In DICTIOUS you will not only get to know all the dictionary meanings for the word User talk:Theknightwho/2023, but we will also tell you about its etymology, its characteristics and you will know how to say User talk:Theknightwho/2023 in singular and plural. Everything you need to know about the word User talk:Theknightwho/2023 you have here. The definition of the word User talk:Theknightwho/2023 will help you to be more precise and correct when speaking or writing your texts. Knowing the definition ofUser talk:Theknightwho/2023, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.
Latest comment: 1 year ago2 comments2 people in discussion
Why did you revert my edits? The words are written in different scripts (Latin, Cyrillic, and one Inuit). At least be consistent and add each of them on every single page. And as for "aza", I only removed duplicates. What you just did makes no sense at all. Shumkichi (talk) 08:46, 2 January 2023 (UTC)
@Shumkichi Because that is irrelevant. We put things that look similar there, regardless of whether they’re the same script. I then reverted all the ones where you removed anything. Theknightwho (talk) 12:00, 2 January 2023 (UTC)
Macedonian Ќ ќ, Ѓ ѓ etc.
Latest comment: 1 year ago6 comments2 people in discussion
Hello Theknightwho! Many words in the Macedonian entries, in the headword lines, in declension tables, in derived and related terms etc. are messed up, they link to wrong or nonexistent words. I assume this is related to the problem pointed out above, about the Serbo-Croatian "ć". In Macedonian Ќќ /c/ and Ѓѓ /ɟ/ are separate letters, different than Кк /k/ and Гг /g/, and the diacritic shouldn't be stripped away. It shouldn't be stripped away from Ѐѐ (сѐ, нѐ vs се, не) and Ѝѝ (ѝ vs. и; See ◌̀#Macedonian) too. The diacritic mark should be stripped away only in the accented letters: А́ а́ Е́ е́ И́ и́ О́ о́ У́ у́ Л́ л́ Р́ р́; See ◌́#Macedonian. Gorec (talk) 11:55, 6 January 2023 (UTC)
Hello @Theknightwho. I want to report another problem, which I think is related to the above. When we add Macedonian translations that contain accented letters (А́ а́ Е́ е́ И́ и́ О́ о́ У́ у́ Л́ л́ Р́ р́), the autogenerated edit descriptions that appear in Revision history or in Contributions history always show redlinks, even if those entries already exist. For instance, there is an entry забележителен, but in the autogenerated edit description "забележи́телен" is shown as redlink (t+mk:забележи́телен (Assisted)). Can this be fixed somehow? Thank you. --Gorec (talk) 20:49, 23 January 2023 (UTC)
@Горец Thanks for this. It's definitely a related issue, but I think it's a bug in the translation plugin (which is something I don't know anything about). I would put something on the WT:Grease Pit. Theknightwho (talk) 20:52, 23 January 2023 (UTC)
Number list errors in Korean
Latest comment: 1 year ago7 comments2 people in discussion
Could you please take a look at 일곱째 and the other similar pages in CAT:E? The only other person who has recently edited a module invoked on these entries is Benwing2, but their edits were to a function related to Korean usage examples, whereas yours were to entry name normalization code, which seems much more likely to be relevant. Maybe the code is trying to compare NFC-normalized Korean to NFD-normalized Korean, or similar. 70.172.194.2518:58, 6 January 2023 (UTC)
Latest comment: 1 year ago11 comments2 people in discussion
As I saw you have recently been editing the expensive translation pages, I decided to run the script I wrote to check for a very specific kind of error: the use of {{tt}} or {{tt+}} outside of {{multitrans}}, thus leaving ⦃⦃unreadable code like this¦¦¦¦¦¦¦¦¦⦄⦄ in the visible page output. The script detected problems on the following pages:
Good catch! Not sure what happened with go, but love was down to me forgetting to include the first checktrans section. The savings with these are immense - just look at water/translations. Theoretically, we could put one invoke at the top of the page (taking no arguments), put the whole rest of the page inside <nowiki>/</nowiki>, and then process the whole page from a single call. There would be downsides to that (e.g. one error breaking everything), but it might be something to consider for very large pages. The current lite templates are a huge faff for tiny gain. Theknightwho (talk) 23:34, 12 January 2023 (UTC)
I'm not sure. Given {{head}} is probably the major culprit (meaning there's little use in doing this for each language separately), it would be a problem for sure. Theknightwho (talk) 23:47, 12 January 2023 (UTC)
In this minimal example, it seems like wrapping ==Headers== in a template that just spits out its input makes the links go away. I think that's probably a particularly bad thing on pages that are really large, because those are the ones where the links are most useful, vs. having to deal with the full page's wikicode when you just want to make a minor correction. It could still be worth it overall as a last resort if the alternative is having visible output break. 70.172.194.2523:59, 12 January 2023 (UTC)
Yes, you're right. I've realised it's possible to use Erutuon's {{multitrans-nowiki}} this way, and have changed the sandbox to a (somewhat mangled) version of ⽉. It's removed all the edit links there as well. The way around this might be a Javascript gadget, though. Theknightwho (talk) 00:03, 13 January 2023 (UTC)
Such a gadget would have to not only add the links when action=view, but also figure out when action=edit which section is supposed to be the one being edited (using some ad-hoc URL parameter from the links), remove everything before and after that section from the edit window, and then upon saving add everything before and after back to the saved page content. And deal with "show changes" / "preview" too. It would be doable but it seems likely to be inelegant/brittle. 70.172.194.2500:13, 13 January 2023 (UTC)
One way to do it might be to use the Labeled Section Transclusion extension (which we should have installed, as it's the way Wikisource works). The model would be:
Put the sourcecode for the very large page on a subpage. Everything must be nowiki-fied, except for the <section> tags and the headings we want to have edit buttons.
Use <section> tags to divide up the subpage as appropriate. A section for each language, and one for each etymology (if more than one) is a reasonable compromise.
A single invoke on the main page processes all of the transcluded sections in one go.
That's creative. It might solve the link problem, but from how you described it to me, a pretty technical user, I anticipate a significant cost of extra confusion for average editors, possibly to the extent that just losing the links would be better (and there may be other complications I can't think of, as I'm not intricately familiar with that extension). IDK. 70.172.194.2500:39, 13 January 2023 (UTC)
Yes, I agree. It's inelegant at best. I did try forcing the parser to process a template before it processed the argument (i.e. wikitext) being passed to that template, and a small part of me still thinks it might be possible with some very careful manoeuvering. What I tried was:
Passing the wikitext we want to preprocess as a sole parameter to a template (as with {{multitrans}}).
Setting the main invoke as the sole parameter name of the template (i.e. {{template|{{#invoke:}}=...}}). This results in the parameter name being the desired output.
Have a second invoke within the template itself which outputs the only key in frame:getParent().args.
The reason for doing it this convoluted way was to see if the parser would process parameter names before parameters themselves, which seemed like it had an outside chance of letting me dynamically deploy strip markers before the argument itself got processed, but I couldnt seem to get it to work. Ideally, this would be a way to let us use nowiki tags without placing them on the page (thereby reducing confusion), but would of course be less efficient than doing the whole page all in one go. Theknightwho (talk) 01:00, 13 January 2023 (UTC)
Hiding vandalistic user names
Latest comment: 1 year ago2 comments2 people in discussion
For future reference: it doesn't do much good to hide a vandal's user name if it's still there in the rollback message. You can't hide the edit if it's the current one- but you only have to hide the edit summary. I prefer to hide as much as possible of that kind of vandalism so they have nothing to show for their efforts- it's like painting over grafitti. Chuck Entz (talk) 05:49, 16 January 2023 (UTC)
I know you characterized my preferred style as "pointless" but I thought I would offer an explanation for the reasoning behind it without protesting the changes you made. I'm open to better understanding how I can adhere to a style that's consistent with the expectations of the wiktionary community. Just let me know.
Latest comment: 1 year ago5 comments3 people in discussion
You're a right foul git. Why are you reverting my edits without explanation? I gave a quotation from the Oxford Dictionary, whilst you gave no reason or evidence. Qiu Ennan (talk) 10:52, 25 January 2023 (UTC)
Because per cent is also dated in the UK, and as a native speaker of British English who lives in the UK, I feel like I'm in a reasonably good position to say that. I also cannot find that quote in the OED. Are you sure that's where you got it? Theknightwho (talk) 10:59, 25 January 2023 (UTC)
Per cent is not dated in the UK: it's used in the BBC amongst many other sources. Despite being a native British English speaker (purportedly), you are also prone to mistakes and so you should still provide a source instead of using anecdote, especially when removing other people's contributions. Also, look at the dictionary: per cent is the first form whilst percent is labelled US. https://www.google.com/search?client=firefox-b-d&q=per+cent
I have looked at “the dictionary”. Have you seen this one? Note that it’s also a British dictionary. And no - your quote is from StackExchange, as the link on that page doesn’t actually work.
(Historical note: I blocked "Qiu Ennan" because he seems to speak some made-up Anglish and has fought previously against real everyday native English speakers like myself and Theknightwho, and against the truth of how the language is spoken. This harms the project and is just really fricking annoying.) Equinox◑02:34, 28 January 2023 (UTC)
I think you wrongly reverted my edit on muff diver.
Latest comment: 1 year ago4 comments3 people in discussion
Latest comment: 1 year ago6 comments2 people in discussion
The South African and West African usage have the same meaning . Igbo/Nigerian scammers operate around the world including in South Africa however Mugu is an Igbo word. Explain at once why you have separated obviously same definitions without providing an edit summary. Beaneater00 (talk) 22:09, 27 January 2023 (UTC)
Because in South Africa, it's an alternative form of moegoe, which comes from Afrikaans (and before that, the etymology is uncertain). It's not at all clear that we can simply merge them together, especially given that moegoe doesn't only mean "fool", as the definition is a little more complex than that. There's no harm in keeping them as they are, as it shows additional nuance. Theknightwho (talk) 22:15, 27 January 2023 (UTC)
Who says that moegoe comes from afrikaans ? The other language listed is an urban cant. The Afrikaans usage could just as well reflect Cape Coloured or urban Black African use. This 'moegoe' does not exist on the af.wiktionary, I searched long and hard for an Afrikaans dictionary online, Do you know anyone who speaks Afrikaans ? The only things I found on the web for this << moegoe >> were verbatim copies of your Wiktionary entry. w:André Brink who you cite was a linguistic 'reformer' and anti-Apartheid writer . Perhaps he injected Bantu vocabulary into Afrikaans and it's not an indigenous Afrikaans word that evolved separately , with a separate range of meaning. The definitions which you have cited are apparently garnered from his book quote Why do you have so many comments on your page from today and yesterday about your contentious edits ? It is not apparent to me that you have done anything more than read the entry that already exists and take it as divine word .Beaneater00 (talk) 23:20, 27 January 2023 (UTC)
I've already told you where to go if you want to dispute the etymology. If you do that without some kind of consensus, I'll just revert you and stop you from editing the page, because you'll be edit warring. Theknightwho (talk) 04:28, 28 January 2023 (UTC)
you gotta use sandboxes
Latest comment: 1 year ago6 comments2 people in discussion
Hi. I see yet another error, 'Lua error in Module:data_consistency_check at line 807: attempt to index local 'frame' (a nil value)'. I've asked you before to use sandbox modules, you really need to use them even though they take more work than modifying the production modules directly. Benwing2 (talk) 19:44, 1 February 2023 (UTC)
@Benwing2 I naively assumed the data consistency check could only be called in one way. In any event, we don't want the new check to show on most pages, as it's really verbose. Theknightwho (talk) 19:56, 1 February 2023 (UTC)
OK but you should still be using sandbox modules, which you seem resistant to doing. Essentially you need to copy the module itself to a userspace module, along with any calling modules, then test in the userspace module, and then push all relevant modules to production at the same time. User:Erutuon has a different way of doing this that may be more clever. Benwing2 (talk) 20:20, 1 February 2023 (UTC)
@Benwing2 What I'm struggling with is predicting where errors will crop up. I spent about an hour writing that new function, and it worked on the page I expected it to (and which I wrongly assumed used the only way of calling it). In other cases, I've previewed changes on several pages where I expected any errors would manifest before pushing something, before realising the error happens on (say) 1% of pages. It's frustrating, because I'm not sure that a sandbox would solve that as I don't just push things blindly, but I'm not sure of the best way to test things like this short of having a functioning mirror, as often it's just not possible to know where the errors will happen. Theknightwho (talk) 20:28, 1 February 2023 (UTC)
I see. Some suggestions: (1) Program defensively if possible, e.g. in this case, it's possible to fetch the current frame using mw.getCurrentFrame() so you don't need to pass it in. (2) You can often work out all the places that call a module using Special:WhatLinksHere and being selective with the 'Namespace' dropdown. (3) If all else fails, you can always search through the dump file; I've done that several times when I need to change a function I know is called from various places. The dump file is about 1G compressed and it takes about 13 minutes to search through it entirely using a Python script on my (rather old) Mac Book pro (if you're interested in my scripts, let me know; they are in but this repository is huge and needs clean up, which I can do). You can also extract out just the Module space code, which is a lot smaller, and search through that. Benwing2 (talk) 20:44, 1 February 2023 (UTC)
@Benwing2 Thanks. I had the same idea re the frame, as all I really needed was to make sure preprocessing happened. That's a very good idea re the dump file - I'll check that out.
I've considered setting up a testpage which would ideally throw errors if any other page could throw one. That might be quite tricky to do, but it shouldn't be too hard to set up something that covers all the usual bases in one place. Theknightwho (talk) 20:57, 1 February 2023 (UTC)
Latest comment: 1 year ago7 comments5 people in discussion
Hi. I am not active on the English Wiktionary, I am active mostly on the German Wiktionary and very little on the Czech Wiktionary, where I had recently a short discussion with Dan Polansky. I'm not his friend, and I don't want to be his advocate or stand up for him. First of all, I would like to clarify which side is right and whether the blocks are correct and reasonable and Dan Polansky deserves them or whether they are exaggerated and unfair.
As a reason for the last block you stated: Continued engaging in personal attacks, despite numerous prior warnings and blocks. This sounds rather general, so I would like to politely ask you if you could provide diffs that show his personal attacks leading to the current block and also at least some of the numerous prior warnings he was given. Thank you very much. Amsavatar (talk) 22:34, 1 February 2023 (UTC)
@Amsavatar Hello. Just to give you an idea of the problem we’ve had with Dan, I think it would be good for you to look at his talk page here. Note the numerous sections where he makes personal attacks against other users (, , , , , ), including the one which he made today () which I blocked him for. Calling unwitting users “semi-intelligent aliens”, albeit amusing, is not acceptable, and is likely to drive contributors away given how frequently he does this sort of thing. Some of the other personal attacks, however, are considerably more serious, and display a level of hostility that is fundamentally at odds with a collaborative project.
Please also note his extensive block log, which includes a (later reduced) indefinite block which refers to this discussion, during which Dan receives the block and which gives a little background. The discussion was continued here. The numerous attacks on his talk page have all been made since he returned, and I have given him blocks of doubling length (1 week, 2 weeks etc) on the basis that he knows very well what the problem is, and simply refuses to acknowledge that his behaviour is a problem. He has been told hundreds (if not thousands) of times to cut it out.
If you insist, I will be happy to get you as many diffs as you want, but I’m going to tag @Chuck Entz, @Surjection, @Benwing2, @Vininn126, @-sche who can all attest to his obstructionism, rudeness, inability to participate in good faith and general detriment to the health of the project. Like me, they are all administrators (with Chuck and Surjection also being bureaucrats). Theknightwho (talk) 00:10, 2 February 2023 (UTC)
IMO Dan is in a class of his own. Along with what is mentioned above, Dan refuses both to cooperate with others and acknowledge even slightly the problematic nature of his edits. He also has a lot of energy and will engage in endless arguments, swamping pages like WT:Beer Parlour. I believe blocking him is warranted and I don't see any likeliness of him improving over time, esp. as he has been a very-long-time contributor and has been problematic for the entire time. Benwing2 (talk) 00:26, 2 February 2023 (UTC)
"including the one which he made today () which I blocked him for"
That does not contain anything that rises to a personal attack however. His sputtering on his talk page is many things, but bannable it is not. A ban of one month for that is evidently excessive. ←₰-→Lingo BingoDingo (talk) 18:52, 2 February 2023 (UTC)
@Lingo Bingo Dingo In the wider context of everything else, it's a continuation of exactly the same thing. I have merely been doubling the length of the block, as recommended by Equinox in the discussion on -sche's talkpage. Every single time Dan has returned, he has immediately started engaging in exactly the same behaviour as before. Since his return on 30 January, he:
Continued to push to discount people's votes when he doesn't like them:.
It's all just a continuation of the exact same behaviour he was engaging in before. In addition to those, there are plenty of new edits which, when taken together, show that his attitude has not changed one bit. Before his return, I also came across this thread on the Czech Wiktionary which had some quite revealing comments (which I've used Google Translate on, so they're a bit squiffy in places):
In the English Wiktionary, there are countless liars and liars, as well as bullies; it's no honey. They are a disgrace to Anglo-Saxon culture.
If it seems insane, it's probably because it is insane, and the administrators of English Wiktionary appear to be a bunch of incompetent morons. In addition, they appear to be a bunch of liars, liars, people unfit to keep correct corporate or official records, fraudsters, etc. Well, again, it's human, all too human. Better than Hitler, Stalin, Mao, Hegel, and other allegorical vehicles.
That the number of powerful people displaying grossly objectionable behavior has increased on the English Wiktionary in recent months and years is another matter; there were always quite a few bad people, but they didn't have that much power.
Can you seriously argue that those seem like the statements of someone who is going to work positively and productively with other users here? Theknightwho (talk) 19:35, 2 February 2023 (UTC)
Latest comment: 1 year ago2 comments2 people in discussion
Hey, I noticed that the {{sic}} template is not displayed properly in entries (see the quote in up the creek for example). I suspect it is caused by one of your recent edits in important modules. Could you please take a look at it? (Sorry if I'm wrong and there is another reason for the error.) Thanks, Einstein2 (talk) 23:54, 2 February 2023 (UTC)
Latest comment: 1 year ago10 comments2 people in discussion
There are cases where using {{inh}} without specifying a term does not add the and corresponding request category, e.g. {{inh|en|enm||*ar}}. I'm not sure if it's just when {{{4}}} / {{{alt}}} is given but {{{3}}} isn't, or if there are more complicated rules. If we were talking about {{der}} / {{bor}} then there would be things like {{der|ro|sla}} (specifying a language family avoids the term request), but that wouldn't apply to {{inh}} AFAICT. 70.172.194.2523:15, 3 February 2023 (UTC)
Good point. I'm currently about to publish a big memory-saving edit on dar, and this came up re the Middle Persian borrowing of Tat dar. It seems as though that if {{{tr}}} is given that that also changes things, as it requests the native script instead (which is something I should have double-checked). I think this should all be possible to account for with the following logic:
If {{{3}}} is not given, then check if {{{4}}} has been.
If yes, do nothing.
If not, check if {{{tr}}} is also given.
If yes, display , demand {{{parentlangname}}} and categorise as a native script request.
If not, display , demand {{{parentlangname}}} and categorise as a term request.
There may be yet more rules (which I will check), but that should at least cover the current use-case and the most common one. We'll cross the {{der}} and {{bor}} issue when we get to them, I think, as dealing with language families obviously makes it a little trickier.
I've incorporated the new change into {{m-lite}} instead, which also covers various other edge cases. It gives the wrong output if a transliteration is given for an empty term in a language that uses the Latin script, but that's an extreme edge case. Plus, {{m}} and {{inh}} behave differently under those circumstances as well, which doesn't seem quite right.
Slightly confusingly, the argument parentlangname works differently in {{inh-lite}} and {{m-lite}}: in {{inh-lite}}, it works as mentioned above. In {{m-lite}}, langname fills that purpose already. As such, parentlangname is a boolean that changes the error message accordingly. Theknightwho (talk) 01:33, 4 February 2023 (UTC)
As you’ve probably spotted, I refactored this in a way that avoids duplicating the lists of lang codes in each module. I only encountered one language family in use - sla - but I’m not satisfied with the bodge I did to get it working properly. I think I’m going to have to add the family codes as the second-last branch of the main switch, which unfortunately means duplicating those ones specifically. However, this should make the code easier to read. Plus, I’ll update the consistency checker to account for it.
Oh, and another thing: I’ll see if it’s feasible to use a similar method to incorporate proto-languages. That way, we can drop the normal modules altogether from some of the large pages, which should help. Theknightwho (talk) 16:14, 4 February 2023 (UTC)
Thanks for implementing that. I had no idea there was a way to check the first character of a string using ParserFunctions. 70.172.194.2511:11, 6 February 2023 (UTC)
It was a faff, but much cleverer people than me managed to work out how to do it before we had Lua, so I took some old revisions of the Wikipedia string functions and used those as a starting point. The fact that memory usage scales logarithmically with each subsequent call made me suspect that removing all calls into Lua link templates would cause a big drop in memory usage, and thankfully it seems that was correct! The really tricky bit was actually removing the asterisk for the link.
I've decided against using it, as the effect on performance is intolerable for very little benefit, unfortunately. I'll delete the pages shortly. Theknightwho (talk) 23:49, 19 February 2023 (UTC)
Latest comment: 1 year ago14 comments2 people in discussion
I didn't check this edit thoroughly, but the Slovene link to eden should have the diacritic stripped, and if there's one error like that then there could be more. 70.172.194.2510:47, 6 February 2023 (UTC)
What about making a template that we could subst to generate the proper invocation to m-lite? I'm thinking something like {{subst:m-lite-generator|ru|ино́й}} => {{m-lite|ru|sc=Cyrl|иной|ино́й|tr=inój}}. Might not be worth it. Just a potential idea. 70.172.194.2510:52, 6 February 2023 (UTC)
Thanks. That subst idea sounds like a good time-saver.
I've been thinking of setting up templates similar to langname-lite which will let us do this stuff semi-automatically. It'll be clunky (as each term will need its own entry in the switch table), but it would mean we could take advantage of a data consistency check. That way, if the primary transliteration/entryname/sortkey function changes, any that need changing will get flagged up as well. Plus, they'll only need to be changed in one place. Theknightwho (talk) 11:00, 6 February 2023 (UTC)
I've thought of the possibility of transliterations (etc.) getting out of sync too, but I assumed it would be a minor issue, assuming the old transliteration was still "good enough". For entry names that would be a much more significant problem as the link wouldn't work at all. Checking these things would be a good idea. That said, IDK about putting everything in another big lookup table. I probably would've gone for a solution like scanning every page using {{m-lite}} and checking them using an external script. The table would have the benefits of reducing wikitext clutter and duplication, however, so I'm not opposed to it. It just seems like it might increase the effort needed to convert a page to using lite templates. 70.172.194.2511:20, 6 February 2023 (UTC)
Another consideration is that so far we've been replacing things like {{m|faciō}} by {{m-lite|la|facio|faciō}}, but also replacing {{m|facio|facere}} by {{m-lite|facio|facere}}, so the entry name is not always a normalization of the display name. And similarly, some transliterations could potentially be manual overrides over the defaults, etc. 70.172.194.2511:31, 6 February 2023 (UTC)
I think with sortkeys it's probably a must, because I don't think we want weird stuff like private use characters to be in "public facing" markup. You're right about it potentially being more faff, though. I think with overrides, that could be dealt with by having some kind of override exit point that allows manual specification without causing a data consistency issue. I'll have a think about how to do it.
I have a feeling that scanning every page is likely to cause Module:data consistency check to start throwing memory errors, as apparently that's one of the main reasons the Chinese modules are such memory hogs (though I don't know quite how many pages are involved at any given time). Theknightwho (talk) 11:33, 6 February 2023 (UTC)
. Maybe it would help to make m-lite/new just return the original m template call when the argument contains a link. It would be possible to make a smarter implementation but not sure it's worth it. 70.172.194.2501:57, 20 February 2023 (UTC)
Thanks. It's probably possible to just subdivide the string with a gmatch or something, though the wikitext would start to get very messy. It's also "destructive", in that it's non-trivial to convert it back again if we don't need the lite templates anymore. Theknightwho (talk) 02:04, 20 February 2023 (UTC)
This is an unfortunate side-effect of the changes I made to Module:languages to escape formatting characters. I'll update Module:lite-new to resolve them into their normal forms. Not sure why it wasn't displaying correctly, though - possibly some kind of double-escape effect going on. Theknightwho (talk) 03:13, 20 February 2023 (UTC)
"I have a feeling that scanning every page is likely to cause Module:data consistency check to start throwing memory errors, as apparently that's one of the main reasons the Chinese modules are such memory hogs (though I don't know quite how many pages are involved at any given time)."
My idea was more along the lines of implementing a Lua module to check one page for correctness, and then using an off-wiki Python (or $FAVORITE_LANGUAGE) script to check the output for every page that uses these templates. Or maybe the whole thing could even be done in $FAVORITE_LANGUAGE. That seems in some ways a much easier solution than giving every single term "its own entry in switch table", but the switch table does have other benefits. And either way we'd have to handle cases of intentional entry name/transliteration overrides. 70.172.194.2502:13, 20 February 2023 (UTC)
Removal of Unsportedpage sorter
Latest comment: 1 year ago3 comments2 people in discussion
OK, didn't know you where working on the template. I was about to change it back, but I'll just leave it for now as it will eventually be correct. --Christoffre (talk) 07:50, 14 February 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
There are still entries flooding in to Cat:E from an absentminded error you made earlier, but there's also something else at work: entries that use {{ja-r}} either directly or through list templates are throwing an error with the message " Lua error in Module:ja-ruby at line 508: Separator "%" in the kanji and kana strings do not match" (see 々#Usage_notes_2 for a typical example).
When I checked the transclusion list of one of them, I saw that a couple of your edits to the Module:links family had to do with "risky characters, and it looks like "Separator '%'" is one of those risky characters. I haven't figured out whether this is the cause, but I figure you can connect the dots a lot faster than I can, so here it is, for whatever it's worth. Chuck Entz (talk) 02:24, 18 February 2023 (UTC)
@Chuck Entz That should now be fixed. These sorts of errors are (sadly) to be expected with the changes I'm making, as they're generally caused by nonstandard/problematic uses of these core module functions. I do try to screen for them first, but it can be very difficult to know what horrors will come out of the woodwork.
In this case, % probably shouldn't have been chosen to be used like this, because it's used in URL codes (e.g. wiki/%26 will take you to the page for &). That causes problems if rubytext ever needs to be used with numbers, for instance. Theknightwho (talk) 03:10, 18 February 2023 (UTC)
Language-specific module handling
Latest comment: 1 year ago2 comments2 people in discussion
I don’t know what you are doing, but it has created bare errors: “Lua error in Module:languages at line 159: bad argument #2 to 'gsub' (string/function/table expected)” e.g. دوسر and אריסאFay Freak (talk) 21:28, 18 February 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
e.g.:
{{R:du Cange|verbum}}--> verbum in Charles du Fresne du Cange’s Glossarium Mediæ et Infimæ Latinitatis (augmented edition with additions by D. P. Carpenterius, Adelungius and others, edited by Léopold Favre, 1883–1887)
@Atitarev It's because I updated Module:languages so that we can store the list of scripts for a language as a string instead of a table if it only has one; Ukranian only has Cyrl listed, where (e.g.) Russian has Cyrl and Brai, so wasn't affected. I did this because modules shouldn't be reading from Module:languages/data2 etc. directly, but instead should be getting the list from Module:languages, which will take this issue into account. The translation adder seems to be a special case, but I'm looking into it.
In terms of practical impact, though, this shouldn't really cause any problems: the script isn't actually specified in the translation template that gets added unless the language doesn't have it listed (which means it wouldn't be affected by this issue anyway). Theknightwho (talk) 16:04, 23 February 2023 (UTC)
Thanks, it seems to work now. I didn't express myself clear before. It wasn't "weird", it was sort of broken. I had to manually remove "C" from "Script code:", otherwise it was giving "Please use a valid script code" error. Anatoli T.(обсудить/вклад)21:50, 23 February 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
Hi. I use getLanguageData() in Module:User:MewBot to fetch data on all languages. Between yesterday and today this broke; now it hits the 10-second max running time after it's processed between 1100 and 1150 languages when before it got through all 8173 languages. This breaks various bot scripts such as the one I just wrote to remove redundant Chinese translations (which needs to fetch info on languages so it can convert lect translations from 'zh' to the appropriate code). Do you know why this might have happened? Did you add significant amounts of info to each language that would have resulted in this? Thanks! Benwing2 (talk) 20:02, 25 February 2023 (UTC)
@Benwing2 I’ll take a look when I get home. I certainly haven’t added lots of info to each language, but I did make the change that mul and und have every script - which does get pulled through if you access Module:languages/data/all as MewBot does. Does the issue still happen if you exclude those two? If that solves the problem, we can work out a way to handle scripts with those two langcodes more efficiently (e.g. I’ve already put special logic into findBestScript to avoid this issue). Theknightwho (talk) 20:21, 25 February 2023 (UTC)
Latest comment: 1 year ago2 comments2 people in discussion
Hey. Re: "го́йда-хуёйда". I couldn't think of anything similar in English then. Now I recall. It's like someone childishly says "coffee- fuckoffee" or something, LOL. Anatoli T.(обсудить/вклад)01:12, 1 March 2023 (UTC)
Latest comment: 1 year ago2 comments2 people in discussion
Hello, it seems that Nôm characters listed within headwords do not show up anymore in the entries. I'm not familar enough with these so I'm not sure if it's due to your edits at {{vi-noun}} (and other Vietnamese headword templates) or something else entirely, so pardon me if my assumption missed the mark. PhanAnh123 (talk) 11:23, 4 March 2023 (UTC)
Latest comment: 1 year ago2 comments2 people in discussion
Thanks for doing something to fix the tone markings in the Wu romanization. Hope other dialects like Wenzhounese can get added soon.
Right now I don't like how Shanghainese is the only supported dialect since it's not even considered the "purest" Wu (and that would be Suzhounese). WiktionariThrowaway (talk) 19:30, 6 March 2023 (UTC)
It would be good to add more lects - I agree. Right now, I've been focusing on getting automatic transliterations working for other lects, which was the main impetus for improving the way tones are handled with Wu: e.g. 吳/吴. More generally, I've been trying to better integrate the modules for the Chinese lects with Wiktionary's core modules, which should make adding new lects more straightforward. Theknightwho (talk) 19:41, 6 March 2023 (UTC)
Latest comment: 1 year ago2 comments2 people in discussion
Thanks for fixing this up. I don't know what to do with the transcription, though. Looks like we'll need to enter it manually, but I don't see an option for that. kwami (talk) 06:46, 14 March 2023 (UTC)
@Kwamikagami There is no current support in the Ukrainian modules for manual translit. This could potentially be added but it might be a lot of work. Probably better is to modify the Ukrainian transliteration code to recognize the ₚ char (which just displays as a box on my Mac in Chrome; we probably need a font addition to the CSS to make sure this shows up correctly). Benwing2 (talk) 02:34, 15 March 2023 (UTC)
hyphens remaining at the end of lines when displaying suffixes across lines
Latest comment: 1 year ago3 comments2 people in discussion
Hello, Theknightwho. I see you edited Module:links recently. I wonder if it would be possible to make the {{l}} and {{m}} link displayed as nowrap, as it looks rather clumsy (in the case of suffixes) when a hyphen remains at the end of a line on its own and the rest of the suffix goes to the next line. (An example, see near the bottom right corner.) I've been thinking about replacing hyphens with non-breaking hyphens for the display, but they cannot be copied as real hyphens. Also I'm aware that a nowrap would affect spaces as well (and word-internal hyphens should be handled too) so it's not ideal either. Maybe an uncopiable nonbreaking space could be inserted after a leading hyphen? Do you have some idea? Adam78 (talk) 15:05, 14 March 2023 (UTC)
Latest comment: 1 year ago18 comments3 people in discussion
Hi, I see a substantial and increasing number of one-char Chinese entries in CAT:E again, just curious if you've made changes that result in this? Benwing2 (talk) 02:36, 15 March 2023 (UTC)
The memory errors in CAT:E increased by ~ 12 in the last 30-60 minutes. I suspect this: I think we need to think more carefully about how to make changes without ever-increasing memory usage. Benwing2 (talk) 08:14, 16 March 2023 (UTC)
@Benwing2 I suspect you're right. I added those as a precautionary measure, as I don't know if anyone is pushing raw category/file links through any of the substitution methods (I'd hope not, but it's possible). It's not necessary if things are done via Module:links, though, as it's already set to ignore these four at a much earlier stage if it detects them as embedded links. See Module:links#L-209: they just get returned as-is. By extension, this means everything from {{also}} to {{lang}} is covered as well, as they all work via the link module at some stage. Do you think I can probably remove these? Theknightwho (talk) 08:21, 16 March 2023 (UTC)
I suspect they can be just removed; but if you're adding those just as a precautionary measure, you might want to remove them but in their place add some temporary template tracking code to see if they're actually doing anything. Within a few hours the Special:WhatLinksHere tracking "category" will have entries in it if the code is actually doing something, and then you can correct the callers as appropriate. Benwing2 (talk) 08:35, 16 March 2023 (UTC)
@Benwing2 Seems like something’s pushing raw category links through, though I’m unsure what as of yet: . That shouldn’t be happening, really, as it’s obviously pointless. Theknightwho (talk) 16:02, 16 March 2023 (UTC)
Each substitution method in Module:languages does temporary substitutions to protect formatting twice: once before preprocessing, and once straight after the substitutions/just before postprocessing, in case any formatting/weird stuff gets added by a module. Everything then gets put back at the end. A couple of transliteration modules were adding categories by concatenating them, which meant these were being picked up in the second round. I’ve adjusted those modules so their categories get dealt with properly.
When there are embedded links, Module:links still runs the unlinked text through makeDisplayText, which is done by iterating over all the gaps between the links. This is to pick up things like false palochkas, character escapes and whatever. However, I was only checking for piped links, as I didn’t think there was any way for an unpiped link to reach that stage. What I’d missed was that - very occasionally - there will be unpiped category links that get fed directly into the link template, which basically just get ignored for the reason I explained earlier. However, it meant they were incorrectly being treated as unlinked text at this stage. I’ve updated Module:links accordingly. I’ve not been able to figure out a pattern that perfectly matches piped and unpiped links in all contexts, but I’m confident the two separate patterns I came up with are 100% accurate to the parser (and are an improvement over what we had before). Theknightwho (talk) 17:07, 16 March 2023 (UTC)
Great, sounds good. When matching links I've always used separate patterns for piped and unpiped links; User:Erutuon sometimes uses a single pattern maybe using %f but it may not be 100% accurate. Benwing2 (talk) 17:12, 16 March 2023 (UTC)
@Benwing2 @Erutuon Given we were already being more permissive with our embedded links than standard wikitext anyway, I decided to make it as permissive as possible (since I noticed we do actually use "illegal" links like ] in headword templates already). As an extreme example, it can now cope with {{l|mul|] ]]}} to output ]. The only thing you can't do is use ] (though you can get round the display text problem with nowiki tags). Theknightwho (talk) 20:31, 16 March 2023 (UTC)
@Benwing2 I did actually do that originally, but there are massive performance issues unless it's either declared in the function itself or cloned, which is a pain as three functions use it. The reason I opted for a separate module was to save it being loaded unnecessarily. Theknightwho (talk) 20:43, 16 March 2023 (UTC)
@Benwing2 In all honesty, I think we should probably look at sorting out the Chinese modules. They're seriously bloated.
I did develop a bunch of string functions which are in Module:string utilities, which are designed to use the string library wherever possible - only using ustring as a last resort. It varies, but they definitely do help, so we could start using those in more places. Theknightwho (talk) 20:49, 16 March 2023 (UTC)
┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ I agree and I think we should seriously look into my earlier suggestion of encoding the T<->S tables more efficiently as they have to be a big part of it. Benwing2 (talk) 20:51, 16 March 2023 (UTC)
@Benwing2 They are definitely a contributing factor, but there are quite literally thousands of Chinese data modules in the subpages of Module:zh. I've noticed T:zh-pron, T:zh-dial and similar using 8-15MB each in some cases, and I'm convinced it's due to this. By comparison, the traditional/simplified tables are small fry.
A major concern I have is that the Chinese modules make very little use of the main infrastructure, or at best it's piecemeal, which means there's a lot of duplication going on. I think we should probably tackle Module:zh-pron and its submodules Module:cmn-pron, Module:yue-pron etc. first, as it's the main Chinese module. I've done a few things, but they probably need a total rewrite. Theknightwho (talk) 20:58, 16 March 2023 (UTC)
@Benwing2 I think one cause of the spike yesterday was this diff, which was fair enough as there was a minor bug in a local function of Module:string utilities that tries to convert ustring-only patterns into string-compatible ones: ? after a multibyte character wasn’t being converted properly, which affected a few gsubs. Since patching that and restoring it, the number of entries in CAT:E has decreased. Theknightwho (talk) 14:16, 17 March 2023 (UTC)
Latest comment: 1 year ago6 comments4 people in discussion
The Hebrew-specific modules haven't been changed recently enough to cause this. I suspect it may have something to do with delimiters in the parameters being messed with by another module. Chuck Entz (talk) 15:14, 15 March 2023 (UTC)
Those are fixed, but there's a separate issue that seems to be correlated somehow with |pausalwv= (though there are a couple of exceptions with the same error but without that parameter and a few with the parameter but no error).It happened after @Erutuon edited Module:he-common, but I have no idea if that caused it. Chuck Entz (talk) 15:17, 16 March 2023 (UTC)
@Benwing2: Thanks for fixing it. Embarrassing mistake. It happened because I had logged around that statement and then deleted the whole block of text. — Eru·tuon18:33, 16 March 2023 (UTC)
Latest comment: 1 year ago2 comments2 people in discussion
Hey: I saw this diff and I have been trying to copy it elsewhere. I wanted to say that there is some functionality of zh-l that is lost in bor en cmn: look at Diaoyutai & Fengqiu (manually adding character forms) and also the * functionality in zh-l (where no traditional or simplified are displayed when you put the other into zh-l). I don't know if you're working in this area or not, but I assume you are/were. If I make no sense, nevermind. --Geographyinitiative (talk) 22:06, 17 March 2023 (UTC)
@Geographyinitiative These can both be solved by the same thing, actually. You can use // to give manual forms: {{l|cmn|釣魚臺//釣魚台//钓鱼台}} gives 釣魚臺/釣魚台/钓鱼台(Diàoyútái). Because (1) manual forms override automatic ones and (2) empty forms aren't shown, you can put // at the end to stop automatic simplification: {{l|cmn|詞典//}} gives 詞典(cídiǎn), whereas {{l|cmn|詞典}} gives 詞典/词典(cídiǎn). On the other hand, * at the start is used for reconstructions, as with other languages. Theknightwho (talk) 22:14, 17 March 2023 (UTC)
Latest comment: 1 year ago4 comments3 people in discussion
Hi. Should we use IPA for this? I'm speaking of tone specifically. There was general consensus a couple years ago that Chinese should be in IPA just like any other languages, but we had one major contributor (who was adding a lot of valuable material) who threatened to quit Wiktionary if we switched to IPA. kwami (talk) 02:31, 18 March 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
Hi, I see that you're updating ryu-r to be in line with ja-r. Would you mind creating {{ryu-r/multi}} and {{ryu-r/args}}, similar to {{ja-r/multi}} and {{ja-r/args}}? This would be immensely helpful for some of the pages that are very close to the lua memory limit and has quite a number of transclutions of ryu-r, such as 人 and 月. – Wpi31 (talk) 09:21, 23 March 2023 (UTC)
Latest comment: 1 year ago19 comments4 people in discussion
Hi, I see CAT:E has ballooned again with memory errors. We really need to optimize Module:languages (especially) and Module:links to avoid doing unnecessary work. I think this needs to be your top priority currently. You should add checks in various places to see if there are any chars that could potentially cause issues, and do nothing if not. Currently it seems we're doing a whole lot of unnecessary splitting, processing and rejoining. I also see you're calling mw.clone() on Module:languages/data/patterns in various places; that uses extra memory, is it really necessary? Benwing2 (talk) 22:04, 23 March 2023 (UTC)
@Benwing2 This happened after I implemented the fix for the Korean transliteration issue, unfortunately, but I've been too tired to deal.
I remember that the pattern cloning was necessary because there was a massive, unexplained slowdown without it, but I can't seem to replicate it now. Looking back, I have a feeling it was probably down to the mw.ustring.gmatch bug.
OK thanks, please do take care of yourself and get some sleep! (You mentioned working in law, and I know at least in the US some law firms are notorious for working their employees to death.) Benwing2 (talk) 22:31, 23 March 2023 (UTC)
Hi, there are still 28 or so terms in CAT:E. We really need to bring them down. I can help you work on optimization, but I can't do it alone as I don't understand some of the specifics of the code you've written. Benwing2 (talk) 04:29, 26 March 2023 (UTC)
@Benwing2 Hiya - sorry, I forgot about this yesterday. I've dealt with most of them, but the two Han characters are proving quite tricky. I'll see what I can do. Theknightwho (talk) 20:20, 26 March 2023 (UTC)
Thanks. How did you bring them down? What I'm suggesting is looking for opportunities to avoid unnecessary work in Module:languages, not simply using ever more *-lite templates. IMO doing this is not hard and very important. I would do it myself but I don't know which sorts of optimizations of this nature are safe. Benwing2 (talk) 20:22, 26 March 2023 (UTC)
@Benwing2 Mostly with {{tt}}, but ~5 needed lite templates. I agree that we need a longer term solution, but I was trying to get these out of the category in the short-term, as it may take a bit of work to touch up Module:languages (as there's quite a lot to go through, and I'm not happy with the overall structure right now). In the immediate term, I think it's okay to keep adding more lite templates as and where necessary, as we can always remove them once the memory issues are a bit less pressing. Theknightwho (talk) 20:27, 26 March 2023 (UTC)
The optimizations I'm suggesting will not require major rewriting of Module:languages and will need to be done even once rewritten so IMO you may as well look into them now as they will save a lot of time worrying about adding lite templates and such. What I'm suggesting is similar to what I already do when parsing <...> inline modifiers; first check to see if there's a less-than sign anywhere, and if not, avoid loading Module:parse utilities and fall back to simpler code. In your case, check e.g. for brackets, apostrophes and other things that might require you to split the string into parts and process each part individually, and fall back to simple code that doesn't do any splitting. Essentially you optimize for the most common/simple case and avoid invoking (and ideally even loading) the more complex code. Benwing2 (talk) 20:33, 26 March 2023 (UTC)
@Benwing2 I've made a few changes along these lines, which seems to have helped. I've had to keep one instance of mw.clone in doTempSubstitutions: depending on the params, certain additional patterns get inserted into the table of patterns that gets iterated over. If you don't clone the pattern table, these seem to get inserted into the version of the table sitting in packages.loaded, which means they're still there the next time the pattern table gets loaded. On the other hand, using mw.loadData isn't an option as we need to insert the extra patterns, plus it seems to cause a memory increase anyway. Theknightwho (talk) 00:35, 27 March 2023 (UTC)
Also just as an FYI, I've used the U+100000-1FFFFD range for the PUA substitution characters, because we can take advantage of that in capture patterns. e.g. "\244*" will always match a character in that range, but it also means that patterns like "^*" are usable as well (for a string of PUA chars at the start of a string). The only false positives would be non-UTF-8 compliant, and they get caught earlier in the process. Theknightwho (talk) 00:49, 27 March 2023 (UTC)
Over the years, we've had several cases of people erroneously creating entries with PUA characters in the entry names. I hope that kind of thing won't lead to weird and hard-to-diagnose errors before we spot such entries and delete them. Chuck Entz (talk) 01:05, 27 March 2023 (UTC)
@Chuck Entz They won't cause errors - those characters simply won't go through a lot of the text processing that other characters will. Given they're PUA characters, it doesn't really make much sense for them to be processed anyway, so it doesn't really matter. It still works, though, if you really want to do it: e.g. {{l|en|}}, {{l|en|''''''}}, {{l|en|w:''''''}} becomes , , and so on. I can put a pre-check on it to stop people doing this, though, as there's a chance the text might get mangled (but again, as they're PUA characters, people shouldn't be making these in the first place, so what's getting "mangled" is actually totally meaningless). Still no errors, though. Theknightwho (talk) 01:15, 27 March 2023 (UTC)
Thanks. Yeah sometimes mw.loadData() increases memory usage; I ran into that with the {{place}} data, where separating out the functions and tables and using mw.loadData() on the remainder increased memory on a test page with about 60 invocations of {{place}} from I think 25 to 29M. I think the issue is the wrapping of tables, which adds a bunch of overhead, which is only made up for if you load it a whole bunch of times (apparently 60 wasn't enough). As for PUA chars in entries, I wouldn't worry too much about them esp. if the pre-check for them will add memory. Benwing2 (talk) 01:26, 27 March 2023 (UTC)
I'm not sure it's the number of uses, as doTempSubstitutions gets used (tens of) thousands of times on large pages. Who knows.
@Chuck Entz Thanks - that happened after I updated the lite templates to escape * at the start of translits, as it was sometimes wrongly causing a new list to start. I just got rid of the lite template here, as the page is well below the limit. Tangut translit is intensive (as it's all in a character database), but the page is still only at 46MB.
For the record, I have been trying to bring down 一 and 茶 for the past few days but to no success – 一 is now not in CAT:E presumably due to changes with the Korean templates as mentioned, but there is too much stuff in the descendants of 茶, which I think a multidesc template would help. – Wpi31 (talk) 05:51, 27 March 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
Hi, there are a bunch of errors now in CAT:E. Looks like they are due to a malformed regex coming from one of the serialization modules. Benwing2 (talk) 04:41, 29 March 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
I think you need to rethink your edits relative to this code. For instance, Philistine is a real language of unknown affinities, and has 11 lemmas. Many of them are questionable, but we can't rfv them- because your edits have messed up the processing of "und-phi", which is an exception code based on "und", not "und" itself.
Also, "und" has been used as a placeholder when the correct language code is unknown. Most of the time this is a bad idea, but there are some cases where it's at least arguable. For instance, while inheritance by "und" is clearly wrong, borrowing doesn't require any relationship between the two languages, so borrowing into a language with an unknown code can certainly happen. On a practical level, these "und" descendants have been added by very knowledgable people to deal with problems they couldn't solve. They aren't going to be easy to clean up. Chuck Entz (talk) 16:00, 1 April 2023 (UTC)
After spending a good part of my day on module errors, it looks like the problem with Philistine has nothing to do with the "und" part. Instead, it's due to it having nil instead of a family code. There are a number of languages like this, mostly in the Americas but also in Africa or the ancient Near East. What they have in common is extremely limited attestation due to having died out before a decent corpus could be collected or produced. With such a limited corpus, it's often impossible to determine whether a given language is related to others, or even whether it's an isolate. I've spot-checked a number of these, and they all throw the same error in the {{rfv}} template, due to the module's checking the family in order to decide which rfv forum to send it to. The simplest solution would be to just assume that anything with a null family code goes to RFVN, and skip the rest. We really do need to have {{rfv}} working for these, since unknowns tend to attract crackpots or overconfident idiots who are only too happy to fill in the blanks. See for instance the trouble we've been having with Illyrian (never mind that we're pretty sure it's Indo-European), though Philistine itself is a pretty good illustration, too. Most of the Philistine entries were added and/or edited either by BedrockPerson and socks or by ShlomoKatsav, who probably knows better by now. Interestingly, most of ShlomoKatsav's entries were deleted after an RFV initiated by what seems in retrospect to be a BedrockPerson IP sock. But I digress...
Also, the two Proto-Turkic entries in CAT:E seem to be the same thing going in the other direction: they're only attested in one very old manuscript covering a multitude of languages- so it's apparently impossible to tell exactly what language they are- but there's enough data to figure out which branch they belong to. Chuck Entz (talk) 02:43, 2 April 2023 (UTC)
@Chuck Entz You make some very good points, which I hadn't considered - I've removed the ban.
This is something that came up due to the problem of substrates, where (at least internally) they're treated as etym-only variants of und, but with the language family of whatever their parent is; as what distinguishes them is that they have families for parents, not regular languages. Before the major update to etym-only languages that just happened, they were being caught by the ban on using family codes in descendant templates, as the template would grab the parent, see that it's a family, and throw an error. However, after the update, they'd be getting processed as und instead - and therefore working.
To be honest, I suspect the old logic was simply an oversight, as substrates are hardly used. If we allow und for descendants, then there's no reason we shouldn't allow substrate descendants either. Theknightwho (talk) 16:05, 2 April 2023 (UTC)
@Chuck Entz I've fixed the family bug. The underlying issue was that the language's inFamily first checks if the language is in family X. If not, it then checks if the language's family is in family X, (and if not checks that family's family etc). It's that second step which was throwing the error, because you can't check what family nil is in. I've changed it so that it immediately returns false if a language doesn't have a family; the practical result being that Module:request-forum will see that these terms aren't Italic, so sends them to WT:RFVN. Theknightwho (talk) 23:05, 2 April 2023 (UTC)
Incorrect Middle Chinese final for some characters
Latest comment: 1 year ago2 comments2 people in discussion
I saw you had recent edits to Module:ltc-pron/data so I wonder if you know where the following bug is coming from.
The final in the Middle Chinese table (4th row) for certain rimes loads the wrong rime character but the right number.
Examples where you can see include: 溢 (showing 眞 instead of 質) and 沃 (showing 冬 instead of 沃)
It seems the items in the "data.fin_conv" variable in the module are the ones affected.
@Zywxn: these are in fact correct and not a bug: 質 is the entering tone equivalent of 眞, likewise for 冬/沃 and the other ones in data.fin_conv. The entering tone variant is by all means phonologically the same as the non-entering tone equivalent except that it ends in a stop and being an entering tone.
For the purpose of the module, saying that a character has 冬 rime and entering tone is sufficient for telling the reader that it has 沃 rime. – Wpi31 (talk) 19:25, 2 April 2023 (UTC)
errors with ancestral_to_parent
Latest comment: 1 year ago4 comments2 people in discussion
@Benwing2 Thanks - was a metatable issue, where the parent now caches _type during the creation of an etym-only language due to running :hasType(), but that was interfering with the etym-only language's own :hasType method by giving a false positive when it checked for a cached _type table: Theknightwho (talk) 04:12, 3 April 2023 (UTC)
This is definitely a non-obvious interaction; can you add a comment by your change in ? Otherwise it won't be at all obvious why this is being done. Benwing2 (talk) 04:17, 3 April 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
My bot created a bunch of these, which is an indication that the category code isn't working quite right. I've stopped it from creating more but the category code needs fixing here. Benwing2 (talk) 05:50, 4 April 2023 (UTC)
Latest comment: 1 year ago7 comments4 people in discussion
Hi again, see CAT:E. I purged the category, which eliminated the Old Iranian errors, but the Middle Iranian ones remain. I assume these categories shouldn't exist at all; "Middle Iranian" and "Old Iranian" are not languages at all, but families (and not even clades at that). However, this is a sore subject for User:Vahagn Petrosyan, who insists that these "languages" should exist for ease in porting over Armenian etymologies, where it's apparently difficult to figure out which of several candidate Old and Middle Iranian languages are the right donors for various borrowed terms, and so the linguists working in this area lazily put "Middle Iranian" and "Old Iranian". I have difficulty understanding how this can possibly work, as the phonologies of different Old and Middle Iranian languages are radically different, but go figure. Benwing2 (talk) 06:31, 5 April 2023 (UTC)
@Benwing2 @Vahagn Petrosyan So Old and Middle Iranian are the only two etymology-only families that we have. Unlike with substrates, where they behave like variants of und, that doesn’t seem appropriate here. I can fix them up, but the question still remains whether we should even have something like this at all. Theknightwho (talk) 10:54, 5 April 2023 (UTC)
I have come to the conclusion that despite some claims, Old Iranian borrowings in Armenian probably do not exist. The main reason for having separate "Middle Iranian" and "Old Iranian" has disappeared for me. I don't mind if the codes are deleted. You can bot-replace {{der|xcl|MIr.}}, {{der|xcl|OIr.}} with "Middle {{der|xcl|ira}}", "Old {{der|xcl|ira}}". Vahag (talk) 11:58, 5 April 2023 (UTC)
I find it okay too, even though it is certain that both Old and Middle Iranian borrowings exist in Aramaic. Not every language conceptualization with a name needs to have a code, if it is not even a language per se and etymological storytelling does not require it either. It is perfectly okay if Sarri.Greek has written “Hellenistic Koine Greek” on various places because that’s what she got to know while language codes may not do anything here. I used “Middle Iranian” because of analogy and Systemzwang more than inherent reason. Fay Freak (talk) 14:29, 5 April 2023 (UTC)
@Vahagn Petrosyan @Fay Freak From a technical perspective this is easy to fix - I agree it isn’t a true family, but it feels a bit crap to remove it entirely. This feels like the sort of thing we could keep for the purpose of categorisation, but nothing more.
That being said, I don’t know much about Indo-Iranian languages - I’m just commenting on the fact that there’s no technical limitation. Theknightwho (talk) 14:34, 5 April 2023 (UTC)
Latest comment: 1 year ago2 comments2 people in discussion
If it helps any, these are completely unrelated to the Middle Persian issue. There seems to be something about Module:ka-form of in those entries that's not setting up the parameters for Module:links the way the latter is expecting since you changed the code. I checked the transclusion list for one of the entries: none of the Georgian-specific modules or Georgian-specific data in data-modules has been changed recently except for some changes to Module:Geor-translit on March 30. Chuck Entz (talk) 21:43, 7 April 2023 (UTC)
@Chuck Entz I know what’s causing the problem, as it’s down to Module:links destructively modifying the input. Normally this doesn’t matter, but it does sometimes matter in loops when the same data container gets used over and over. This came up with a Pali module as well.
It’s possible to solve by doing what’s called a “shallow copy” (where you create a copy of the container that also contains the originals, versus a deep copy that copies everything inside as well). That way, any modifications to to the container don’t affect the original one. However, this has a (very minor) memory impact, and also seems to break modules that are accidentally passing the input data as a global variable. I need to catch all of those first (such as the one I just fixed in Module:interproject). The memory impact is negligible, but will probably require some additional saving measures on extremely vulnerable pages like go. Unfortunately, there’s no real way around that impact, as the destructive modification is the most efficient way to implement multiple forms for links.
Latest comment: 1 year ago2 comments2 people in discussion
I'm not sure exactly how to fix these, but I think the problem stems from assuming that a |tr= parameter is always an implicit request for a native-script equivalent. After weeding out a couple of incorrect language codes and fixing cases where the |tr= was an improvised substitute for using an empty first parameter with a display text in the second parameter to display without linking (I haven't figured out yet how to fix some of these in {{desc}}), there are a few cases where the |tr= is some kind of phonetic transcription meant to display alongside the term specified by the usual parameters. There might be a need for a new parameter for this function, but I'm not sure what to call it and it should be discussed first. At the very least I think you should avoid generating a native-script request category for Latin-script languages, since {{auto cat}} will throw a module error if you create it. It may also be desirable to display Latin-script |tr= text for Latin-script terms as if it were a transliteration, even though it technically can't be one and there's no translit module for it. Chuck Entz (talk) 21:01, 8 April 2023 (UTC)
@Chuck Entz I think this is what ts= is for, though using that still produces the request for native script terms. To be honest, I actually think it's worth lifting the ban on having that for the Latin script, as it's still an accurate description for something like {{cog|lus||ts=puŋᴴ}}, which produces Mizo(/puŋᴴ/). Theknightwho (talk) 01:17, 9 April 2023 (UTC)
Template:trans-see wrong markup
Latest comment: 1 year ago4 comments2 people in discussion
I think Template:trans-see currently generates wrong markup.
See, for example, amenity, then Translations, then "convenience — see convenience".
Latest comment: 1 year ago5 comments2 people in discussion
We need to be more selective with where we put this. Using it on something like Module:zh/data/st is like throwing a device in a bucket of water so we can be 100% certain we know what the problem is...
It was working fine for over a month, and I spent a long time working on that check. Something new has thrown it off, but it’s important that we keep it. Theknightwho (talk) 22:08, 9 April 2023 (UTC)
@Chuck Entz If you check Module:zh/data/ts or Module:zh/data/st now, you’ll see there are tons of results, but it runs quickly with a relatively low overhead, all things considered. Seems the serialisation checks are causing problems, which is not surprising given how massive Module:Hani-sortkey/data/serialized is; but using it means we can sort 1.5k Chinese terms with a 1.5MB overhead instead of a 15-20MB one.
The {{auto cat}} issue I’m aware of. It’s because I just consolidated Module:etymology languages into Module:languages, as the modular separation between them was becoming untenable. On pages with very few calls (e.g. 1), require is actually way more memory efficient than mw.loadData - using around 50% less in some cases. Currently, we’re in a brief transitional stage where we’re doing both, so once that’s over these issues should go away. Should hopefully be tomorrow. On a related point, I also rewrote Module:family tree/nested data to be more comprehensive and flexible, but it’s now an even bigger memory hog than it was before. It’s probably why we’re hitting the limit for the first time, as the PIE tree uses 37MB alone. That should drop down to about 25MB with the changes, bringing us well within the buffer. Theknightwho (talk) 22:33, 9 April 2023 (UTC)
@Benwing2 I keep meaning to get around to this. It's just me being slow, that's all. The solution is probably to just delete the codes, but I'll put a bar against the category modules creating categories like these if the object is etymology-only. Theknightwho (talk) 18:09, 10 April 2023 (UTC)
There are also a lot of 'Requests for native script for ETYMLANG terms' as well as 'Requests for Unspecified script for LANG terms' in Special:WantedCategories. The latter should definitely not be generated at all; for the former, either we need to avoid generating them or modify the Requests category code to allow for etym langs (which would be easy but it's not clear to me what the right thing to do is). Benwing2 (talk) 23:06, 10 April 2023 (UTC)
This could also be how we handle request categories, but since they're a maintenance issue, we might want to limit it to the non-etym language only. Theknightwho (talk) 23:37, 10 April 2023 (UTC)
@Benwing2 It's because Vahag expressed dissatisfaction at the terms being divided up between the categories, and I'm inclined to agree; it's much more useful to have the terms in one place, in a similar way that "derived from" is a catch-all category.
The reason Category:Old Armenian terms derived from Middle Iranian languages is being put in the Indo-Iranian category is because Middle Iranian has Iranian as its parent, which means that its family is Indo-Iranian (not Iranian). Currently, the boiler is only looking for the family, but we need to make sure it handles the parent as well - which is how etym-only languages are handled. Presuambly this is because etym-only families haven't really been dealt with properly until now.
I also suggest we formally disentagle substrates from families, and just make their parent und. If they need to have a specific family, we can just set it in the data. This is now easier, because the etym language data has some of its keys automatically moved around before being returned, to make sure that it's fully compatible with the normal language data.
One last thing that occurs to me is that there's actually no reason why we couldn't use class inheritance during the creation of full language objects, too. For example, it might make sense to use it for Bokmål and Nynorsk, since we also have the code no for Norwegian. Theknightwho (talk) 00:32, 11 April 2023 (UTC)
Hmm. If we are to add all terms derived from an etym language to the category corresponding to the etym language's parent, I think we should bring this up in the BP first as some people could conceivably object. As for substrates, I have no strong opinions here; go ahead if you want to make changes. As for the catboiler code, the relevant code is around line 1080 of Module:category tree/poscatboiler/data/terms by etymology; I don't think it will be hard to fix it. For regular-language inheritance, sure, I guess if we are stuck with all three of 'no', 'nb' and 'nn' as regular languages, we can make the latter two inherit from the former (although in practical terms, what would this get us?). Benwing2 (talk) 01:46, 11 April 2023 (UTC)
This could also come in handy with Norwegian, too: Category:Terms derived from Norwegian Bokmål doesn’t go in Category:Terms derived from Norwegian at the moment, and it probably should. Inheritance would solve that. Plus, it would simplify the data, as it’s a simple way of coordinating languages. Perhaps this is a way to solve the problems we have with langs like Chinese or Tibetan, where multiple L2s sometimes get placed under the same header. I appreciate that this starts to blur the boundary between language and family, but in practical terms we already do that. Theknightwho (talk) 18:51, 13 April 2023 (UTC)
@Chuck Entz Thanks. I introduced a change so that Module:zh-translit will check the term linked by {{zh-see}} if it can't find any instances of {{zh-pron}} on the page, which means it now works for simplified or variant terms as well. Occasionally, there'll be a chain (e.g. simplified forms of variants), so it iterates until it finds {{zh-pron}} (or neither of them). Theoretically, that means it can get into an infinite loop if you point a group of pages at each other with {{zh-see}} without ever using {{zh-pron}}, but of course that should never happen...
It's relevant to note that you will get different results with different lects of Chinese, as it depends on whether the lect is referred to in {{zh-pron}} or not. If it isn't, it moves on to {{zh-see}}. I suspect there isn't a real mistake on the entries here, but rather an instance of two pages missing a Xiang reading, which causes an infinite loop due to missing data. It's not unusual for very common characters to be variants for certain readings, so a pair of these would cause this. Theknightwho (talk) 06:08, 20 April 2023 (UTC)
I was right: 水 can be a variant of 媠 in Min Nan, which can be a variant of 嫷, which in turn can be a variant of 媠 for a different reading (ad infinitum). I'll put in something to stop that happening, as it's an edge case I hadn't considered. Theknightwho (talk) 06:23, 20 April 2023 (UTC)
Pali/Sanskrit Thai Script PUA Characters
Latest comment: 1 year ago3 comments2 people in discussion
In Module:languages/data/2, you wrote "FIXME: Not clear what's going on with the PUA characters here" apropos U+F700 and U+F70F. The basic issue is that ญ YO YING and ฐ THO THAN drop their bottom parts when combined with a mark below. As basic rendering of Thai is fairly old, predating OpenType, there arose the convention of using those two PUA characters for the glyphs thus modified. There's also what I think is the barbarous practice (@Octahedron80 for comment) of using those truncated glyphs for YO YING and THO THAN in all positions when writing Pali. (It's certainly not a universal convention for Pali.) Standards-oriented font creators tend to reject this glyph encoding; the glyphs are highly unlikely to be encoded in Unicode as separate characters on the basis of Pali.
At thwikt, these two dated letters will only show in headwords, as Thai fonts can natively show them. Traditionally, ญ & ฐ should be "unfooted" even without marks under them. Why? Because old people didn't write the "foots". Not only Pali, but also Sanskrit. When naming title, we just use normal letters. The dated letters are added in entry_name and change them to normal, in case if they are linked from somewhere.
Adding VS01 is the new method that could be used as variation sequences, but currently no standard for Thai is released and no font supports it yet. (The fonts would be updated after the standard.) PS: VS01 is used in other scripts. --Octahedron80 (talk) 14:20, 21 April 2023 (UTC)
Sinitic pronunciation modules
Latest comment: 1 year ago2 comments2 people in discussion
First off, thanks for your recent work on mod:cmn-pron. While it is possible to improve the other modules in a similar vein, I think ultimately mod:zh-pron will need some major revampping, which I already have some basic idea what to do – this involves moving most of the repetitive lect-specific stuff into a data module or their respective modules (plus, the mixture of the labels and the if else conditionals is difficult to maintain), and standardizing the interface etc. This will be a time-consuming job, which hopefully I could start working on it in late May. In the meantime, I don't think there is much point in changing the modules, since they'll probably have to be redone to suit the new interface anyways. – Wpi31 (talk) 17:09, 26 April 2023 (UTC)
@Wpi31 Yes, you're right! My main concern was to develop a more sensible baseline transcription which all the other conversions can work from, which is essentially Pinyin with some of the quirks removed (iu --> iou etc). That's now done, which should make the task of simplifying other parts of the module much simpler.
Latest comment: 1 year ago3 comments2 people in discussion
Hi again,
I created 韶神星 for the asteroid (6) Hebe, but am having trouble with zh-forms. The derivation is not 韶 + 神 + 星, but an abbreviation of 韶華 + 神 + 星. Is there not an option for that in zh-forms, or am I just missing it? kwami (talk) 00:05, 6 May 2023 (UTC)
@Kwamikagami I don't think there's a dedicated way to do it, but I've changed the gloss for the first character in a way that hopefully makes it clear what's going on. Theknightwho (talk) 00:27, 6 May 2023 (UTC)
Latest comment: 1 year ago1 comment1 person in discussion
Hi,
Thanks for working on the Russian transliteration scraper. I am more keen about this possibility - multiword Thai, Khmer, etc. transliterations, be it with spaces (or ]][[ between words) and respellings (I am sure it can't be done otherwise - no tool will transliterate Thai with 100% accuracy). Also about using |tr= to respell words with multiple readings. Don't mind negative comments you received. Using Japanese kana to transliterate everything is also desirable. Feel free to experiment. It's up to you but the Russian transliteration is working, even if adding manual translits is a bit of a burden. Also @Benwing2. Anatoli T.(обсудить/вклад)03:12, 8 May 2023 (UTC)
Latest comment: 1 year ago1 comment1 person in discussion
Hello, hope you are well. I wanted to ask you if you would take a look at Template:zh-wikipedia. You can see it in use at 中國/中国 (Zhōngguó). I believe that the order of Wikipedias on the Zhongguo entry (that I just linked) is not academically or ethically justified. I believe that Wiktionary should follow the order of Wikipedias as at , and that Template:zh-wikipedia should "auto-sort" (automatically sort) the words in the order of the Wikipedias that Wikipedia uses, no matter what order the Wikipedias appear in the editor's view in any given Wiktionary entry's Template:zh-wikipedia. (1) Even if you don't agree, would it be technically feasible to code in an auto-sort or something like that on Template:zh-wikipedia? Maybe for a different order? (2) Is the Wikipedia order justified within Wikipedia but yet not justified for sorting the Wikipedias when linked from Wiktionary? (3) Is there perhaps another order that you can think of? Even ones you don't think would be good? I've always felt there was something wrong with what I was doing, and I'd like to see if that resonates at all with you (or anybody reading this). Thanks for any feedback. --Geographyinitiative (talk) 12:46, 12 May 2023 (UTC)(Modified)
To the removal of the Proto-Northeast Caucasian and Proto-North Caucasian
Latest comment: 1 year ago1 comment1 person in discussion
Hi.
Proto-North Caucasian. I think we should have removed everything related to Proto-North Caucasian, everything that it contains in itself and Proto-North Caucasian itself. Simply on the grounds that this is not a family, but a superfamily, which, moreover, is not yet a proven superfamily.
Proto-Northeast Caucasian. There are still no good reconstructions about the Proto-Northeast Caucasian. Perhaps the only revision of the reconstruction of Starostin and Nikolaev (1994) was the work of Nichols (2003). However, she uses the # sign, which stands for pseudo-reconstructions, which was introduced by Williams (1989). Here everything (Appendix:Proto-Nakh-Daghestanian reconstructions) is incorrectly indicated by an asterisk, this can be misleading. This family has not been proven.
Proto-Northeast Caucasian splits into two branches Proto-Nakh and Proto-Daghestanian. Proto-Nakh is quite well reconstructed, which is not to say, of course, about the Proto-Daghestanian. For some reasons, the Proto-Daghestanian can be left and not deleted, although it has also not been proven. Although recently, for example, in the work Schrijver (2021), the Proto-Nakh are compared with the Proto-Tsezian and Proto-Avaro-Andian. However, the Proto-Avaro-Andian is not worked out properly. And besides, Proto-Tsezian forms is adjusted to the data of Proto-Avaro-Andian.
In accordance with paragraph 3, the modern comparison is not with the Proto-Nakh and Proto-Daghestanian, but directly with the Daghestanian groups (Proto-Tsezian and Proto-Avaro-Andian). Which makes one doubt the existence of Proto-Daghestanian.
Latest comment: 1 year ago3 comments3 people in discussion
When you excluded those languages from diacritic categorization, did you also intend to cause upwards of 150 module errors? When I first saw the module errors, I thought they would only be there for an hour or two while you were working on the changes to switch things over, so I didn't say anything. Chuck Entz (talk) 21:37, 13 May 2023 (UTC)
The problem is that char_category() is local to a portion of the main function when it needs to be pulled out and moved up. Benwing2 (talk) 21:41, 13 May 2023 (UTC)
@Christoffre I don't think so - numbers were excluded from all automatic categorisation (which was something that's been the case for a long time), but I've now changed that as I think they're rare enough to warrant categorising: people were manually categorising them anyway, so it's better to just do it properly.
Changes to modules take time to filter through to individual entries, as (literally) millions of pages can be affected. If you purge the cache on a page it'll do an immediate update, though. If you look at the category now, you'll see all three are in it. I'm sure others will filter through in time. 17:08, 14 May 2023 (UTC) Theknightwho (talk) 17:08, 14 May 2023 (UTC)
Latest comment: 1 year ago8 comments2 people in discussion
Aside from the module errors, these look wrong. For some reason, they're displaying the vowels with iota subscripts as vowels followed by free-standing iotas. This is deceptive, since there are plenty of examples like Greek Αιγίνης(Aigínis), which aren't in Category:Greek terms spelled with ᾼ.
It must have something to do with fonts, since "Category:Greek terms spelled with ᾼ" displays with the iota subscript in the edit window for me, but not in the preview. What's more, if I use a Greek keyboard to type the Greek part in manually as separate characters, it displays as separate characters but doesn't link to the correct category: Category:Greek terms spelled with Αι even though the spelling looks identical to me in preview. For the record, I'm using a MacBook Pro with an old version of MacOS, and I get the same thing with both Firefox and Safari (not logged in on the latter). Chuck Entz (talk) 21:48, 14 May 2023 (UTC)
@Chuck Entz So I have a feeling I know what's behind this, but I need to dig through the guts of the category tree to work out exactly what's happening.
In terms of the display, it looks fine for me, but I think this is related to the fact that subscript iota (a combining character) capitalises as a non-combining iota. Something, somewhere is making the assumption that diacritics never change on capitalisation, but this is an instance where they do. However, if that's down to the font you've got, there's not a huge amount we can do.
The reason for having Category:Greek terms spelled with ◌ͅ is because I recently upgraded the standard characters function so that it's no longer restricted to atomic Unicode characters, as they're often pretty arbitrary and not linguistically interesting. Before, any terms in categories for diacritics were only there because the diacritic couldn't form an atomic character with the base character (e.g. there is no Unicode character B̊, so the function was seeing the individual character ̊ and dumping them there instead). Since I made the change, I realised that it's probably quite interesting to see the spread of diacritics, too, so basically did the inverse of the first change by making sure all the atomic characters got categorised by diacritic as well (e.g. Å). However, the diacritic categories only appear if the diacritic isn't otherwise used in the language: so French Ÿ gets the category Category:French terms spelled with Ÿ, but there is no category for French terms spelled with ◌̈, because ◌̈ is a standard diacritic of French. Theknightwho (talk) 22:30, 14 May 2023 (UTC)
Judging by iota_subscript#Computer_encoding, the matter of uppercasing iota subscripts is a big mess. Unicode says to change them to adscripts (what I see), but a lot of usage stays with the subscript (what you see). An adscript isn't the same as an independent letter, which is why my typing the Greek ended up with a redlink. I'm guessing the module errors are due to different parts of the system (module code?) disagreeing on what exact Unicode codepoint an uppercased iota subscript is supposed to be: the part that converts the lowercase iota-subscripted letter to uppercase is using a different codepoint than the part that checks for uppercase vs. lowercase. Chuck Entz (talk) 00:01, 15 May 2023 (UTC)
I'm sure you're already done for the day/evening, and I'll be gone most of the day tomorrow, but just to be thorough: the alpha with iota subscript copypasted from the displayed title on the pagen in σοφίᾳ(sophíāi) is the precomposed character U+1FB3 (see ᾳ for the unicode info that displays even without an entry), and the category page has the precomposed character U+1FBC (see ᾼ). The first one is described as "GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI", where "ypogegrammeni" is the iota subscript, and the second one is "GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI", where "prosegegrammeni" is the iota adscript. The "Composition" section, however, has "Composition: α + ◌ͅ " for the first and "Composition: Α + ◌ͅ " for the second. I have no idea how that translates to the codepoints the module sees, but I figure some extra data won't hurt. I would also note that {{uc:ᾳ}} > ΑΙ, which is the U+1FBC codepoint, and {{lc:ᾼ}} > ᾳ, which is the U+1FB3 codepoint. Chuck Entz (talk) 02:31, 15 May 2023 (UTC)
This is getting stale. I just added the precomposed letters to the standardChars of Grek under "el" at Module:languages/data/2, but apparently with no effect (I noticed there were other precomposed letters there already). That's about all I dare to do myself, and even that may need to be reverted if there are unanticipated side-effects. We need to do something, though. Chuck Entz (talk) 01:38, 20 May 2023 (UTC)
@Chuck Entz This is now solved. The issue was that Module:category tree/poscatboiler/data/characters checks if single-character inputs are upper or lowercase, and throws an error if they're lowercase (with one edge-case exception if the uppercase form is actually standard, as sometimes happens with ı and I). It does this by running mw.ustring.upper on the input and then checking it's the same, because usually if you run it on a capital letter then nothing happens.
This caused a problem with ᾼ, because the output from mw.ustring.upper is actually ΑΙ (i.e. it capitalises the subscript iota and splits it out as a separate full character). This isn't an error, as it matches the Unicode specification, but it isn't what we want in this context (because what we really want is title case, but there's no function for that). I've added special-case behaviour for subscript iota so that it remains unchanged.
I'll investigate if there are any other oddities like this: one other I've accounted for so far is ß, which capitalises as SS by default. Instead we use ẞ, which is the correct form for contexts like this. Theknightwho (talk) 14:07, 21 May 2023 (UTC)
Latest comment: 1 year ago8 comments2 people in discussion
Please take a look. There are a ton of categories here related to 'LANG terms spelled with X' that shouldn't or arguably shouldn't be there. Most blatantly, Cyrillic is entirely normal for Mongolian. I also don't think digits in terms is remarkable for English. Please review what's there and make appropriate fixes to remove these categories from being created, thanks! Benwing2 (talk) 09:04, 16 May 2023 (UTC)
@Benwing2 Macedonian's been dealt with (which was an error on my part). Regarding numerical entries, they're not that remarkable in English, but if we take the most frequent number (1) and assume everything in Category:English terms spelled with 1 is a lemma (it's not), that only comprises 0.03% of English lemmas. I'd say that's rare enough to be notable. That being said, we may want to consolidate the categories for numerals.
Other than those, I'd say the categories are all fine: Tagalog, Danish and Norwegian don't use "C" in native words, for example. I think there needs to be a special exclusion for entries with "..." from being categorised as having ".", so I will deal with that. Theknightwho (talk) 09:58, 16 May 2023 (UTC)
I think at most we want one 'terms spelled with numbers' category, although I still believe it's better not to have these; nothing terribly remarkable about a term with a number in it (probably English terms with Q and Z are fairly rare too but not remarkable). Do you mean both Mongolian and Macedonian? Benwing2 (talk) 10:27, 16 May 2023 (UTC)
@Benwing2 The only Mongolian categories listed relate to the traditional script, and they're filtering out. We do have a few categories for Mongolian Cyrillic letters, but only for the rare ones used in borrowings: see Category:Mongolian terms by their individual characters.
My initial inclination was to exclude numbers and characters like "." as well, but people were adding them manually anyway. I judged it was better to just handle it automatically, as if we're going to have them we may as well do it properly, and I don't want want to waste time periodically deleting them. Ultimately, they're pretty harmless. Theknightwho (talk) 10:42, 16 May 2023 (UTC)
The problem is that every one of these extra categories adds spam to the category list at the bottom of the page. On a page with a period or number in it, there may be several such categories if we allow them. So they're not totally harmless. I have already done runs in the past removing the manually added number categories and can do it again. Benwing2 (talk) 18:20, 16 May 2023 (UTC)
Chinese links and transliterations
Latest comment: 1 year ago2 comments2 people in discussion
Latest comment: 1 year ago4 comments2 people in discussion
Hi, could you add documentation on the changes you've made to Module:languages? They're extensive and not well-documented and I'm finding it hard to use. E.g. there used to be a function to return the type of a given language object, which seems to have been replaced with hasType(), but the usage isn't well documented and the interface itself may have issues (e.g. is there a simple way of telling whether a language object describes a regular language)? BTW something needs to be done about languages like Rudbari, which are both a regular language and an etymology language; this causes all sorts of problems. Benwing2 (talk) 05:30, 29 May 2023 (UTC)
@Benwing2 Yes, you're right that I need to do this. You can check if a language is regular with lang:hasType("regular"). In this case, I think the documentation points to Module:languages/data/2, which lists the three types, but it would be better to explain it more fully, as the regular type is only assigned if the language isn't one of the other types. Theknightwho (talk) 16:20, 29 May 2023 (UTC)
You're right about the languages which are both etym-only and regular - I don't really know what to do with them, as they're all languages I don't have any experience editing. Theknightwho (talk) 16:21, 29 May 2023 (UTC)
By "regular" does this exclude reconstructed and appendix-reconstructed languages? If so, this may be a mistaken choice. Traditionally "regular" languages are all languages that aren't etym-only. I would keep "regular" in this meaning and use "mainspace" to indicate languages that go in the mainspace (excluding those that go in Reconstruction: or the appendix). Benwing2 (talk) 22:43, 31 May 2023 (UTC)
Latest comment: 1 year ago4 comments3 people in discussion
I think these terms should never have the prefix reduced, with either a secondary stress or an unreduced marker, so the respelling "трехстру́нный" at трёхстру́нный(trjoxstrúnnyj) is quite rare and even unnatural, IMO. Anatoli T.(обсудить/вклад)01:27, 30 May 2023 (UTC)
@Atitarev What is this in reference to? The "fast/casual" pronunciation at трёхстру́нный(trjoxstrúnnyj) has been there a long time. Did User:Theknightwho recently push their translation-scraping code to production? If so and if the scraping is looking at pronunciations, maybe there needs to be a way of indicating a pronunciation should not be scraped (e.g. a boolean flag to {{ru-IPA}}). Benwing2 (talk) 22:47, 31 May 2023 (UTC)
@Benwing2: The fast/casual pronunciation was added in diff. I let it slide, although I don't fully agree. I thought for completeness, it's OK, then. I don't see why all possible incorrect rare pronunciations should be added, though, especially unsourced, so I might remove that.
I messaged Theknightwho, since he added a few words prefixed with трёх-(trjox-), all of them with a reduced pronunciation, he probably referred to that edit by KoreanQuoter as a model. In my opinion, the most accurate and common pronunciation of the prefix is or , respelled "трё̂х-" or "трё̀х" and (reduced) should go. Anatoli T.(обсудить/вклад)02:17, 2 June 2023 (UTC)
@Benwing2 @Atitarev Transliteration scraping isn’t live yet. Atitarev is right about what happened re the pronunciations, and of course him being a native speaker means I’ll defer to his expertise. Theknightwho (talk) 02:32, 2 June 2023 (UTC)
Question about "deleted edits"
Latest comment: 1 year ago4 comments2 people in discussion
Hi, Theknightwho. I have a question and I've discovered I don't know where I can find out any information.
I've looked at my "User contributions" and discovered that there is an "edit count". When I looked at my edit count I discovered that there were 50 "Deleted edits". Wiktionary allows me to see what contributions I've made but I do not know how to access information on which edits have been deleted. This concerns me, because I don't know what it is referring to. Is it edits I've deleted myself? Edits that someone has been going through and deleting? How can I locate these "deleted edits"? Can you tell me anything about this? Any help would be greatly appreciated.
@Bathrobe They're edits on pages which have been deleted. That may be because you created them, but it may also be if they're on someone's userpage which got deleted or something. Without knowing any of the details, it's not possible for me to say, but it's almost certainly nothing to worry about. Theknightwho (talk) 20:31, 30 May 2023 (UTC)
Ok, thanks! As you may know, I'm not on very good terms with one of the editors, one of the grand old men of Wiktionary, an admin, and an extremely prolific contributor from way back, who added some poor translations in languages he knew very little about. I was concerned that such an editor might have been simply deleting my contributions. I'm relieved that it's nothing like this. Bathrobe (talk) 21:54, 30 May 2023 (UTC)
weird displaytitle stuff in userspace
Latest comment: 1 year ago6 comments2 people in discussion
Hi. Go to User:Benwing2/test-ca-noun and click "Show preview" and you'll see a bunch of warnings for headwords that attempted to set displayTitle. This only applies to terms with accented characters in them for some reason. Any idea what's going on? Benwing2 (talk) 22:40, 31 May 2023 (UTC)
@Benwing2 Which bit do you mean? The module's been handling the display title since before I could edit the module, from the look of it.
I was considering moving it into Module:headword/data so that it only gets calculated once per page, but that ran into issues with determining the best script without a language, so that project is stalled for now. Theknightwho (talk) 22:58, 31 May 2023 (UTC)
The garbagey code in lines 732-770 of Module:headword. When I rewrote the module I added a FIXME and isolated the code in one place, but it may need a rewrite. The check for ASCII seems very hacky to me as well as some of the other stuff being done in that code. Benwing2 (talk) 23:04, 31 May 2023 (UTC)
@Benwing2 Yes, I agree. I did recently add a bit for unsupported titles, but it's a bit pointless unless the Javascript gadget can access it in some way. I remember doing the bit to keep the Han script region-neutral, too, but that was a while back. Theknightwho (talk) 23:07, 31 May 2023 (UTC)
Latest comment: 1 year ago2 comments1 person in discussion
Might I suggest that you check for lower(upper(x))=x before throwing an error? In template syntax that would be: {{lc:{{uc:ſ}}}}=ſ producing s=ſ. If false, don't throw the error. Chuck Entz (talk) 23:57, 2 June 2023 (UTC)
Can we do something to fix these? They've been in CAT:E literally for weeks, since you changed the category modules. They shouldn't be throwing errors at all, since the characters only exist in lowercase. Chuck Entz (talk) 13:27, 6 June 2023 (UTC)
'digraph' as POS
Latest comment: 1 year ago1 comment1 person in discussion
Could 'digraph' or 'multigraph' be added as an option for Module:headword/data? ch for example might count as a 'letter' in Spanish orthography (at least it was considered a letter for a while, maybe not anymore), but it isn't one in English. Which means that technically we can have a Spanish entry for orthographic 'ch' but not an English one. Or is 'letter' already understood to include multigraphs?
This came up because of the (IMO) odd obsolete English digraph ée, which might deserve an entry, but I see that we don't have entries for ee or ea. kwami (talk) 06:59, 5 June 2023 (UTC)
Why do people hate you?
Latest comment: 1 year ago2 comments2 people in discussion
It feels like every month somebody has to publicly and explicitly denounce you as a tyrant and the scourge of Wiktionary. I don’t get it. —(((Romanophile))) ♞ (contributions) 16:10, 8 June 2023 (UTC)
@Romanophile I think it's a manipulation tactic: accusing someone of abusing their power makes it much more difficult to hold you accountable, because it "proves" you right. Once one person does it, it doesn't take much for other people who don't like being held accountable to do the same. Theknightwho (talk) 16:16, 8 June 2023 (UTC)
regular languages
Latest comment: 1 year ago6 comments2 people in discussion
Hi. Before things go on too much farther can we please use the type 'regular' to mean any non-etymology language and 'mainspace' to refer to what you have redefined 'regular' as? Formerly regular languages were any non-etymology languages but you redefined it to refer only to mainspace languages. There's not even any simple way any more of checking for regular languages. I can make this change myself but I'm not sure what code depends on the current definition of 'regular'. Thanks. Benwing2 (talk) 21:55, 10 June 2023 (UTC)
I'm in two minds as to whether we should change it to your suggestion: on the one hand, it makes explicit the distinction between full languages and etymology-only ones (i.e. variants), but on the other hand the current definition is a useful catch-all for any language with no special characterstics, and you can simply check for non-etym languages by doing if not lang:hasType("etymology-only") then .... Theknightwho (talk) 22:17, 10 June 2023 (UTC)
So you can't actually just check the way you mention; you need to also check for !family when calling getNonEtymological, which is why I make this suggestion. It is annoying and non-obvious to have to write two checks of the form "not A and not B" to get regular (full) languages, and if we introduce a new type of language-like entity in the future, it will break all such checks. Benwing2 (talk) 23:46, 10 June 2023 (UTC)
I see, yes I didn't realize the 'regular' language type in this defn goes back a way. If you'd rather we can call non-etymological languages "full" languages but IMO we do need a type for this. Benwing2 (talk) 23:49, 10 June 2023 (UTC)
@Benwing2 That's a good point, and I agree "full" is the best name here. Because types specified in the data are inheritable (i.e. etym-only children get all the types of their parents), the full and etymology-only types have had to be hard-coded into Module:languages instead, and get generated the first time :hasType gets called based on whether self:getNonEtymologicalCode() == self:getCode().
For the sake of completeness I've also implemented this for families, so if you only want languages you can do if lang:hasType("language", "full") then ....
I guess we could also modify the method so that you can check not-X as part of a multi-type check (e.g. if lang:hasType("language", "!etymology-only") then ...), but I'm not sure how useful this would be in practice. Theknightwho (talk) 14:11, 11 June 2023 (UTC)
Latest comment: 1 year ago6 comments2 people in discussion
Hi. Many Spanish terms have origins in Lunfardo, which is a Buenos Aires argot derived by mixing Italian dialect words into Spanish grammar. It is given an etymological code es-lun, but that prevents deriving Spanish words from Lunfardo; as a result many existing etymologies wrongly use {{cog|Lunfardo|...}} in place of {{bor|es|es-lun|...}} or whatever. What is the correct solution here? Benwing2 (talk) 06:49, 14 June 2023 (UTC)
I don't think it's a creole, as creoles have their own grammar. Here the grammar is still Spanish. Wikipedia defines it as an "argot", which is its own class of thing. Benwing2 (talk) 15:34, 14 June 2023 (UTC)
Maybe it's similar to if Classical Latin borrowed terms from Vulgar Latin, which is entirely possible; it suggests we need a type of etymology language that runs concurrent with the non-etymology parent, so borrowing can go in both directions. Benwing2 (talk) 15:49, 14 June 2023 (UTC)
@Benwing2 Isn’t it only inheritance where there’s a restriction? Borrowing can go in both directions anyway, can’t it? Maybe I’m misunderstanding the issue. Theknightwho (talk) 17:25, 14 June 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
I can see that you've been very busy with fixing sortkeys, but you really need to deal with the 500 (and counting) entries in CAT:E as a result of your adding error() to the scrape_page function in Module:Jpan-sortkey. The number has increased by almost a hundred in the time it took me to compose this message. Chuck Entz (talk) 19:24, 17 June 2023 (UTC)
I'm trying to clear the "...terms spelled with...read as..." cats. If I add "kun" in the first parameter of {{auto cat}} as the error message suggests, it displays "terms spelled with " ++" with its kun reading of", where should be the kanji in the name of the category, but is instead "Template:-l". I'm guessing the module is assuming that all of these language codes have counterparts to Japanese and its {{ja-l}}, since the display is what one would expect from putting a parameter in a non-existent template. Chuck Entz (talk) 21:08, 17 June 2023 (UTC)
@Chuck Entz Thanks. That bit of the function is little-used, and only there as a last resort. Thankfully that meant it was ~500 entries, not 100k.
The link template issue was very straightforward to solve by simply slotting in {{l}} instead, which means it'll work regardless of the language. They're just links to kanji with no transliterations, so the output is exactly the same as it would be. Theknightwho (talk) 21:20, 17 June 2023 (UTC)
memory errors
Latest comment: 1 year ago3 comments2 people in discussion
@Benwing2 I've not done anything major for a while, but Category:Proto-Niger-Congo language and Category:Proto-Atlantic-Congo language occasionally seem to throw errors for no reason at all. Scribunto memory usage varies each time the page is refreshed, so occasionally a page will start throwing memory errors for no reason. I suspect it's exacerbated by the fact it's almost all down to the family tree, as opposed to most pages close to the limit where it's hundreds of separate template calls. Atlantic-Congo is the largest language family we have, so it's not massively surprising. Theknightwho (talk) 13:31, 19 June 2023 (UTC)
Latest comment: 1 year ago27 comments3 people in discussion
I think you asked me awhile ago to obsolete {{zh-der}}? I can do that but I forget exactly what needs to be done. Do you know where the discussion was? Benwing2 (talk) 03:00, 19 June 2023 (UTC)
Replacing {{zh-der}} with {{col3|zh}} suffices for most pages
If a derived term contains foo/bar (which is used to manually specify simplified forms), it should be replaced by foo//bar
If a derived term contains foo:bar, this should be converted to either foo<tr:bar> or foo<t:bar>. The distinction between tr and t is determined via the regexes at Module:zh/link#L-63.
If a derived term contains foo;bar, this should be converted to foo<q:bar>; when bar is the name of a lect, it needs to be linked with the langcode prefix, e.g. nan:foo or foo<q:Min Nan> for foo;Min Nan. (I'm not sure what the practice should be for this one, TKW has been doing the former and me the latter)
Note that the ; and : syntaxes can (and usually) occur simultaneously. The two are not that common (I estimated that it's probably around 1400 in total, out of the 38k invocations of {{zh-der}}), so I think they could be done manually since they require extra care in most cases, and also because there seems to be sometimes confusion between the two syntaxes.
If {{zh-der}} has |hide_pron=1, or if {{zh-der/fast}} is used, these need to be dealt with manually.
|fold= can be ignored in any case; |title= can be copied wholesale into {{col3}}
@Wpi Thank you. I think the langcode prefix form should be used if possible in place of using a qualifier, because it supplies the info in a machine-readable format, allowing for lang-specific linking and formatting as needed. Benwing2 (talk) 20:38, 19 June 2023 (UTC)
The code in Module:zh/link is a bit confusing but it appears to do special things with terms that have a circumflex (^) and/or an asterisk (*) in them. The circumflex seems to indicate that the translit should be capitalized, but there do not seem to be any occurrences of it in {{zh-der}} calls. The asterisk is less clear to me. There are 50 {{zh-der}} calls with an asterisk somewhere in them, any idea what to do with them?
For reference there are 87 {{zh-der}} calls that use hide_pron (out of 26,873 total; this does not include any aliases of {{zh-der}}, namely {{zh-list}}, {{zh-syn-list}} and {{zh-ant-list}}; I still need to investigate them), and 390 that include a semicolon anywhere in the {{zh-der}} call.
For lect names after semicolons, if you prefix the term with the corresponding langcode, it doesn't display the lect name. I'm thinking of adding some syntax to {{col}} for use in conjunction with langcode prefixes to display the corresponding language name as a qualifier; what do you think of this? If you think it's a good idea, any suggestions for the syntax?
The current code uses {{col3}} unless the pagetitle has more than one character, in which case it uses {{col2}}. Should we duplicate this functionality or always use {{col3}}?
I am planning on checking to see in case of TRAD/SIMP whether the SIMP is redundant, and removing it if so. Sound good?
When a note following a semicolon is not a lect name, should it use <q:...> (display to the left) or <qq:...> (display to the right)?
There are occurrences of items like 東城:] and 宜君:] in the {{zh-der}} calls. Are these supposed to be (badly specified) translits? They end up as glosses in the converted output. What should be done about them?
The following item 拍損:phah-sńg;Min Nan ends up as nan:拍損<t:phah-sńg> in the converted output. I assume this should be a translit and the regexps need to be tweaked to recognized things like ń?
1. * suppresses simplification and pinyin (Sorry I forgot this one). For the warnings you got below, 乾 is usually simplified to 干 but for that particular etymology the simplified form is 乾 itself. Sometimes people use it simply to disable pinyin, which is a valid but not ideal use case; I think these will need to be done manually as well.
3. Maybe?
4. I think we discussed this before(?), and {{col3}} is still very readable on multiword items.
5. That makes sense, especially since on some pages the simplified form is specified to reduce memory usage, but that doesn't really help in {{col3}} according to TKW.
6. No strong preference, but putting the qualifier after the term aligns the characters.
7. These are place names, which are supposed to be glosses.
8. Correct. The regexes in Module:zh/link are probably written quite some time in the past and doesn't really do the job properly.
@Wpi: Thanks for the detailed replies. As for #1, do you mean all occurrences of asterisks should be handled manually? This seems doable as there are only around 50 template calls using them. For #4, I'll use {{col3}} everywhere. For #6 I'll put the qualifier after the chars. For #8 I'll fix up the regex for ń and maybe for other tone marks on n (can they also occur on m?). Benwing2 (talk) 07:15, 22 June 2023 (UTC)
@Wpi: One more question: what about cases like 潑雨;Hakka, Min Nan where two lects are listed after the semicolon? Currently it's unrecognized as a lect and ends up as a qq qualifier. Benwing2 (talk) 07:58, 22 June 2023 (UTC)
For #1, that should be doable. The correct procedure would be to convert *term to term// (i.e. an empty second form to supress simplified forms), but I don't think it's worth the hassle to bot the 50 or so instances.
For #8, yes. Other characters to look out for are: , combined with tone marks U+0300, U+0301, U+0302, U+0304, U+0306, U+030C, U+033F which occur in Kienning Romanised for Min Bei; , combined with U+0300, U+0301, U+0302, U+0304, U+0306, U+030D in Min Nan; combined with U+0300, U+0301, U+0302, U+030D in Hakka; and of course the combining forms. It might also be useful to modify the second regex to (+ )*+ to try and find the transliterations unaccounted for.
For the final question, I suppose we could list them twice, once for each lect, but it feels weird given we don't list the duplicate the ones with Mandarin in it. For now I reckon let's just leave them as is, i.e. <qq:Hakka, Min Nan>
@Wpi: What I ended up doing for the regexes is just decompose everything and look for any of the combining diacritics 0300 (GRAVE), 0301 (ACUTE), 0302 (CFLEX), 0304 (MACRON), 0306 (BREVE), 0307 (DOTOVER), 0308 (DIAER), 030B (DOUBLEACUTE), 030C (CARON), 030D (VERTLINEABOVE), 030F (DOUBLEGRAVE), 0323 (DOTUNDER), 0324 (DIAERUNDER), 033F (DOUBLEMACRON), 0358 (DOTABOVERIGHT) as well as superscript ⁿ. Benwing2 (talk) 10:27, 22 June 2023 (UTC)
OK, I have written the script. Some sample output:
Page 137 乾: WARNING: Ignoring fold=1: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|顛乾倒坤/颠乾倒坤|乾華/乾华}}
Page 137 乾: WARNING: For traditional 乾圖, explicit simplified 乾图 doesn't match auto-conversion 干图 in 3=乾圖/乾图, specifying explicitly: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|
顛乾倒坤/颠乾倒坤|乾華/乾华}}
Page 137 乾: WARNING: For traditional 乾斷, explicit simplified 乾断 doesn't match auto-conversion 干断 in 6=乾斷/乾断, specifying explicitly: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|
顛乾倒坤/颠乾倒坤|乾華/乾华}}
Page 137 乾: WARNING: For traditional 乾清宮, explicit simplified 乾清宫 doesn't match auto-conversion 干清宫 in 9=乾清宮/乾清宫, specifying explicitly: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|顛乾倒坤/颠乾倒坤|乾華/乾华}}
Page 137 乾: WARNING: For traditional 乾紅, explicit simplified 乾红 doesn't match auto-conversion 干红 in 10=乾紅/乾红, specifying explicitly: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|顛乾倒坤/颠乾倒坤|乾華/乾华}}
Page 137 乾: WARNING: For traditional 乾綱, explicit simplified 乾纲 doesn't match auto-conversion 干纲 in 11=乾綱/乾纲, specifying explicitly: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|顛乾倒坤/颠乾倒坤|乾華/乾华}}
Page 137 乾: WARNING: For traditional 顛乾倒坤, explicit simplified 颠乾倒坤 doesn't match auto-conversion 颠干倒坤 in 18=顛乾倒坤/颠乾倒坤, specifying explicitly: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*
乾陵|*乾隆|*朝乾夕惕|顛乾倒坤/颠乾倒坤|乾華/乾华}}
Page 137 乾: WARNING: For traditional 乾華, explicit simplified 乾华 doesn't match auto-conversion 干华 in 19=乾華/乾华, specifying explicitly: {{zh-der|fold=1|*乾乾|*乾元|乾圖/乾图|*乾坤|*乾宅|乾斷/乾断|*乾旦|*乾曜|乾清宮/乾清宫|乾紅/乾红|乾綱/乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|顛乾倒坤/颠乾倒坤|乾華/乾华}}
I'm not sure what's going on with the non-matching simplified stuff, let me know if it makes sense to you. Note also the asterisks in the output, which are getting carried over literally; I don't think this is correct. Benwing2 (talk) 04:55, 22 June 2023 (UTC)
@Wpi, Theknightwho I am going to run the script on {{zh-der}} and {{zh-list}}. I am planning on implementing syntax in {{colN}} whereby writing e.g. nan-hok::冷吱吱 with two colons causes the language name to be displayed as a right qualifier, equivalent to nan-hok:冷吱吱<qq:Hokkien>. Thoughts? Benwing2 (talk) 19:43, 22 June 2023 (UTC)
@Benwing2 Thanks - that sounds good. In terms of the double colon, I'd prefer if the language name always showed as a qualifier if there's a language override in a column template, as otherwise there's nothing visible to indicate it. Maybe we should exclude mul from that, though. Theknightwho (talk) 19:46, 22 June 2023 (UTC)
Yup, that makes sense. I was thinking originally of doing exactly what you proposed but then realized it would be problematic with mul links; but excluding them from the qualifiers should work fine. Benwing2 (talk) 19:49, 22 June 2023 (UTC)
@Wpi, Justinrleung These two templates are redirects to {{zh-der}} but the code that implements {{zh-syn-saurus}} looks specifically for {{zh-syn-list}} (and likewise {{zh-ant-saurus}} looks specifically for {{zh-ant-list}}). I am thinking they can be replaced with {{col3}} where the title can be an abbreviation for Synonyms or Antonyms, expanded appropriately (I've been meaning to implement such abbreviations for awhile now, for various reasons), and {{zh-syn-saurus}} can look for a {{col3|zh}} whose title is "Synonyms" or the appropriate abbreviation. More specifically, instead of writing {{col3|zh|title=Synonyms}} you could write something like {{col3|zh|ti=@syn}} or maybe just {{col3|zh|ti=@s}}, where |ti= is a shorter alias for |title= and things beginning with at-signs are abbreviations. Similarly, @der or @d means "Derived terms", @rel or @r means "Related terms", etc. How does this sound? An alternative is to make the first numbered term after the language be the title with abbreviations if preceded by the appropriate char, e.g. {{col3|zh|@syn|*乾乾|*乾元|乾圖//乾图|*乾坤|*乾宅|乾斷//乾断|*乾旦|*乾曜|乾清宮//乾清宫|乾紅//乾红|乾綱//乾纲|*乾象|*乾造|*乾道|*乾陵|*乾隆|*朝乾夕惕|顛乾倒坤//颠乾倒坤|乾華//乾华}} for synonyms. Benwing2 (talk) 07:26, 22 June 2023 (UTC)
@Theknightwho, Wpi What do you think of my proposal here? We should generalize {{zh-syn-saurus}} to something like {{syn-saurus|zh}}; all it does it look up synonyms or antonyms in the specified thesaurus page and display them. BTW I am also thinking of changing the bare {{col}} template so the number of columns is specified using a named param |n= instead of a numbered param; that way, if |n= is omitted, you get some sort of auto behavior, similar to {{col-auto}} but with fewer chars. Benwing2 (talk) 19:53, 22 June 2023 (UTC)
@Wpi Conversion of {{zh-der}} and {{zh-list}} is in progress; conversion of the syn/ant stuff will come next. See User:Benwing2/zh-der-misc-warnings for warnings related to {{zh-der}} conversion other than those related to |hide_pron= (80 in total) and User:Benwing2/zh-der-hide-pron-warnings for automatic warnings triggered whenever |hide_pron= is seen (87 in total). Hopefully this isn't too many to process by hand but let me know if there's a way of automatically dealing with the |hide_pron= stuff. Benwing2 (talk) 03:55, 23 June 2023 (UTC)
@Benwing2: For {{zh-syn-saurus}}, I think that makes sense if we want to extend the format to other languages.
For the warnings some of them could be automated: more-than-one-slash can be safely converted from a/b/c to a//b//c, more-than-one-colon is just coexistence of both t and tr. The {{{hide_pron}}} stuff are probably automatable, if they are just plain list of terms with no special formatting stuff, but I figure it's not worth the hassle to write the checks for the bot, plus they are usually used on pages with memory errors so it might trigger that. – Wpi (talk) 05:46, 23 June 2023 (UTC)
@Wpi: BTW the number of cases with more than one slash or colon are small, so it might be easier to handle them manually; let me know if you want me to write the bot code to handle them automatically. Benwing2 (talk) 06:22, 23 June 2023 (UTC)
So here's a progress update, I've went through all the warnings for the thesaurus pages (as well as a bunch that the bot missed like an asterisk at the end of a term) and cleaned them all up.
During this I find that:
Some of the lect qualifiers were not recognized by the bot (or sometimes missed by myself), this seems to be mostly the etym codes under Hokkien, and Sichuanese (the etym code is new so we've decided on calling it Sichuanese, though it used to be split between Sichuanese and Sichuan, so <qq:Sichuan> is the same as <qq:Sichuanese> and should be converted to cmn-sic:)
@TKW It would be helpful to have the Hokkien etym codes autogenerate pronunciation for the subdialect.
Speaking of new etym codes, there's a number of qualifiers that would make sense for a new etym code, e.g. Taiwanese Hokkien, Hong Kong Cantonese, Beijing. I'll likely propose them on BP shortly, and after that there's probably a need to rerun the bot conversion for the qualifiers again.
The bot did not convert qualifiers like <qq:Cantonese, slang> or <qq:Wu, dated> to the ones with prefixes. I've changed them on sight, but there's a lot of them. We might want to use a bot to clean these up. Note that literary refers to the entire Chinese language and not the lect, so <qq:literary, Cantonese is not yue:<qq:literary>. EDIT: I've cleaned up all the ones I could find.
There's also weird qualifiers like <qq:chiefly Cantonese>, which I think would benefit from some syntax that allows writing custom qualifiers like {{place}}, e.g. <qq:chiefly <<yue>>>. This might also solve the problem with multiple lects, so <qq:Hakka, Min Nan> could be <qq:<<hak>>, <<nan>>> (not sure how the transliteration would be handled?). However, I feel like this might be adding too much bloat to the columns module.
There's some stuff that should be in <t:gloss> were put into <qq:gloss>, due to how the people who added them used the wrong syntax; there's not much that we could do with it currently.
When zhx-min: (a family code) is used as a prefix, there's some unexpected behaviour. There are some words that are shared across the entire Min family, so it makes some sense to use it.
@Wpi Thanks. I'll take a look. I think the issue with the lect qualifiers missing is that the bot was only looking for the canonical name of etym languages. I'll redo the code to look at all names and see if I can fix the issues. Benwing2 (talk) 19:06, 23 June 2023 (UTC)
@Benwing2: (Sorry for the pings) I think the problem with the lect names is simply because the name is not listed in the module data, not even as an alias. There's not a lot so I think I'll do them by hand. – Wpi (talk) 06:18, 24 June 2023 (UTC)
@Wpi Those pings you gave me are mostly of cases of a lect name + comma + qualifier. If there are a lot of these (too many to do by hand) I can look into writing a script to fix them. Benwing2 (talk) 06:30, 24 June 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
You've promised several times to fix the documentation for Module:languages. Documentation is hella important, can you please please please actually do this? In this case I want to use lang:transliterate() and it now returns three values (which I've told you before is a bad idea, but that's another discussion). I don't know what those three values are, and it's not documented. Benwing2 (talk) 03:04, 19 June 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
Some or all auto-cat categories are in this hidden category. Is this intentional? If so and if it's not just a temporary hack, it needs to be moved out of user space. Benwing2 (talk) 06:09, 19 June 2023 (UTC)
@Benwing2 This is intentional, though the work I was doing has stalled so I'll create a proper maintenance category. They're all using tables of contents that don't have a language code specified, which means they can't take advantage of proper sortkeys very easily. Theknightwho (talk) 13:33, 19 June 2023 (UTC)
@Benwing2 It means the parameter table can be stored in Module:parameters/data, which means it doesn't get duplicated every time the module is called. If the new preparsing method works we can probably do away with this, as the saving is pretty small. Theknightwho (talk) 02:45, 23 June 2023 (UTC)
@Benwing2 Thanks - I think it’s to do with stuff inside nowiki tags being processed instead of ignored, and possibly something else relating to strip markers in general. Theknightwho (talk) 12:51, 23 June 2023 (UTC)
Your code needs to be more robust in dealing with tags in general: nonvanishing is due to expressions inside <math></math> tags, for instance, and I don't think it's a coincidence that there are html comments in proximity to other false positives. Then there are the timeouts: there has to be some way to limit the time spent by a single transclusion. Wiktionary:Todo/long usage examples goes to a server timeout, so I can't even look at it without disabling my edit-mode-preview setting, but I viewed that page within the last week and remember it having a column for examples, which, if memory serves, are cut off at a certain number of bytes, leaving incomplete expressions at the end. It would be nice to not have whole sections or the whole page broken by individual errors.
Finally, the exclamation mark in the "Invalid syntax detected!" error message is just silly: it already says "Lua error" in screaming red. You might as well add "OHHH NOOO!!!!" or "Dear me, I think I'm going to faint!"... Chuck Entz (talk) 18:11, 23 June 2023 (UTC)
@Chuck Entz So there were some infinite loops going on that were causing problems, because it's (almost) impossible to tell something loaded via frame:preprocess that it shouldn't try to load the cached data, because it's being called while generating the cached data in the first place. I've managed to find an awkward way around it, though. I'll have a look at the others now. Theknightwho (talk) 18:22, 23 June 2023 (UTC)
It might also help to think about why the same thing can work fine when you preview the section, but has a false positive when the whole page is viewed, as with the Synonyms section at volition. By commenting things out there, I found a parameter in a quote template elsewhere in the Adjective section that had mismatched square brackets. I think you need to work on the limits of the parsing: the error message makes it look like the problem is in the first template, while it's actually somewhere else entirely- the error actual makes it harder for the average user to find the error. Chuck Entz (talk) 19:16, 23 June 2023 (UTC)
@Chuck Entz Got it - thanks. I've put the template parser inside a pcall function, which detects that an error has been thrown without actually throwing it. That way, it can fail a particular template parse without bringing down the whole thing. If the template is actually supposed to throw an error, that'll happen once the "real" template call is done later on. Theknightwho (talk) 19:38, 23 June 2023 (UTC)
@Chuck Entz I’ve been doing some work on a more general-purpose parser, which is based on a Python wikitext parser that’s commonly used by bots (mwparserfromhell). Parsing gets really complicated when you get into nested templates - especially when dealing with {{{template arguments}}}, or when {{{{templates are used}} to | generate template names}}. Fortunately, it’s not particularly resource-heavy, and should be able to prevent the issues we’re seeing here. Theknightwho (talk) 16:31, 24 June 2023 (UTC)
@Benwing2 I think it's stabilised at about the position we were at - I'm not seeing any parser errors at the moment, as they're all memory-related.
The full parser is gradually getting there (Module:User:Theknightwho/parser). There's a bug in the tokenizer relating to wikitables that's proving difficult to isolate, but other than that the tokenizer's complete - and it's by far the most complex component. I'm pretty pleased with the performance so far, but coming in under 10 seconds on extremely large pages may prove to be a challenge: tokenising a 100KB page takes about 3 seconds at the moment. Theknightwho (talk) 19:26, 26 June 2023 (UTC)
By errors I mean memory errors. There are 59 entries in CAT:E and most of them are related to memory errors or timeouts. There weren't that many before. Benwing2 (talk) 20:43, 26 June 2023 (UTC)
As a general rule, you shouldn't launch code before it's ready. "It's gradually getting there" isn't a good solution for memory errors IMO. Benwing2 (talk) 20:44, 26 June 2023 (UTC)
BTW maybe there's a way to avoid parsing all parts of a complex page, e.g. skip some of the tables or other complexities and let the built-in PHP parser worry about that. Benwing2 (talk) 20:46, 26 June 2023 (UTC)
@Benwing2 The general-purpose parser is a completely different module, and isn't deployed yet. Aren't most of the new memory errors down to the recent changes to the Chinese templates? Theknightwho (talk) 20:46, 26 June 2023 (UTC)
@Benwing2 Just as an update on this, the general template parser is now in an alpha-ish state: at User:Theknightwho/sandbox10 it's able to render a copy of the page teacher (saved at User:Theknightwho/sandbox9) in a single module call, reading directly from the raw wikitext. I've had to comment-out a couple of templates that nest piped wikilinks in templates (as I forgot to account for them), but arbitrary template/argument nesting, noinclude/includeonly tags, most parser functions and most other things it needs are now working (e.g. T:langname-lite, T:quote-meta and T:q are no bother).
In terms of performance, it's adding about 0.2 seconds to the rendering time (about 2 seconds instead of 1.8), while reducing the memory consumption from 49MB to 31MB. I still need to add a lot of memoization, so I'm sure both of those will come down even further. Theknightwho (talk) 18:55, 6 July 2023 (UTC)
This is great, maybe you could write up a bit on how it works and esp. what the general principles are, for someone who isn't so familiar with how MediaWiki works? We'll need that documentation before this goes live so that other people can fix the parser if something goes wrong. Benwing2 (talk) 19:43, 6 July 2023 (UTC)
@Benwing2 Definitely - I want to get it into a semi-reliable state first, in case things need refactoring. I've just solved the issue with piped links, by the way, so the sandbox now works with those templates included as well.
One of the biggest challenges has been minimising the amount of parsing that needs to be done, which means things like {{#if:}} and {{#switch:}} need to be calculated dynamically as they're being parsed, instead of being passed through frame:callParserFunction() (which would necessitate evaluating every parameter in advance). Theknightwho (talk) 23:49, 6 July 2023 (UTC)
@Benwing2 Just FYI, adding the line mw.loadData = require in the parser tester reduced memory usage down to 22.4MB, and loading time is now equivalent with the standard page. Theknightwho (talk) 14:33, 8 July 2023 (UTC)
This is great! You should test on some other high-memory pages to see if the gains are similar across-the-board. Benwing2 (talk) 18:45, 8 July 2023 (UTC)
You should also test it on time hogs like Coptic entries with humongous inflections and frequency list appendices. Chuck Entz (talk) 21:08, 8 July 2023 (UTC)
@Chuck Entz Good call. (Another way to bring the Coptic entries within time and memory limits I think is by rewriting in Lua, but that would take some effort.) Benwing2 (talk) 21:12, 8 July 2023 (UTC)
@Benwing2 @Chuck Entz So it still runs into trouble on huge pages - partly because there are some bits I've not implemented yet - but also because it does seem to run into time-out issues.
I'll see if I can redesign it to convert templates into functions (instead of re-building them from the cached tokens every time), as that has the potential to massively speed things up. It's a bit tricky, since the parameter values affect which bits get parsed by the builder (which is why it's redone each time), and I imagine it will be a little less memory-efficient, but I think that's fine. Theknightwho (talk) 21:16, 8 July 2023 (UTC)
Sounds good. As a last resort we could add something to the page text of specific pages to indicate that the parser shouldn't be used but we should avoid the need for this if possible. Benwing2 (talk) 21:19, 8 July 2023 (UTC)
@Benwing2 Curiously, it's actually faster for ⲱϣ(ōš) (about 8 seconds), which I suspect is because the Coptic templates aren't that complex for the parser, but there are tons of links all being processed by separate invocations.
@Benwing2 @Chuck Entz As an example of a huge page, I've copied about 3/4 of a into User:Theknightwho/sandbox11 and converted all the lite templates into regular ones, which is about the point where memory errors start happening under normal circumstances. The parser manages it in about 8.5-9 seconds using 35MB, versus 5 seconds using 52MB (i.e. right on the cusp of memory errors).
If I'm able to get a little more speed out of it, I'd say we might even be able to remove all the lite templates from a, as memory increases seem to scale logarithmically with page length. Theknightwho (talk) 05:41, 9 July 2023 (UTC)
@Benwing2 Hmm - the memory use on that page goes down to 23MB(!) if you make package.loaded a weak table, but I'm unsure if that might have weird side effects. By comparison, the original test page ("teacher") goes down to 13.8MB (a reduction of 72%). Even a page using multitrans like "wolf" sees a 50% reduction (38.5 to 19.5MB). Theknightwho (talk) 07:04, 9 July 2023 (UTC)
Oh - one other thing I forgot to mention is that the post-expand include size is halved, too, which helps pages like 不 etc. Theknightwho (talk) 07:32, 9 July 2023 (UTC)
That I don't understand. If you make package.loaded a weak table, does this effectively require() each module once per page? I'm not sure what you are referring to by post-expand include size and why it would be halved. Benwing2 (talk) 07:37, 9 July 2023 (UTC)
┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ BTW heading to bed now ... it's 2:39am local time. Please feel free to post further responses and updates, though, and I'll see them in the (late) morning. Benwing2 (talk) 07:39, 9 July 2023 (UTC)
@Benwing2 In theory it means that require has to load things from scratch every time, since no external modules are in use whenever the parser's not actively loading a template. In practice, though, heavily-used modules like MOD:languages will rarely be out of use (meaning that most garbage collect cycles won't affect them), while language-specific modules that are never used again will be getting periodically cleared.
The post-expand include size is sometimes called the template include size, I think - I mentioned it on the Grease Pit a little while ago, as it was affecting some of the very large Chinese entries. Certain text can have a multiplier applied to it (e.g. if it's used in certain parser functions), and I think I've jumped the gun as it looks like simple transclusion doubles the weight, so the reason it's halved is probably because I've been using a bare {{#invoke:}} for testing.
By the way, we're actually dangeorusly close to the include size limit on a (2,086,521 / 2,097,152), since the lite templates have tons of multipliers applied to their outputs due to loads of parser functions and several layers of transclusion. Once it's exceeded, templates just stop loading and show bare links to the template page instead. I know there are a couple of other pages in a similar position, but I can't remember which right now.
@Benwing2 Just as a general update - I've done a major redesign of the parser in order to eke out some more speed, which is nearing completion. I thought this was going to be at the expense of memory, but it seems memory use has dropped even further anyway, but crossfingers it should be able to match or even beat the native parser in terms of speed. I'll let you know once it's in a position to go through proper testing. Theknightwho (talk) 22:39, 23 July 2023 (UTC)
Latest comment: 1 year ago2 comments1 person in discussion
Just FYI, the otherNames column in the list of etymology languages isn't properly populated in the list, e.g. 'Bohairic Coptic' should have 'Bohairic' and some other things listed but none are. I checked the code in Module:list of languages and it's directly accessing data.otherNames. Either this is the wrong field or something is wrong with that field. Benwing2 (talk) 22:39, 23 June 2023 (UTC)
I fixed this but there is still a lot of breakage in WT:LOL/S, e.g. the parents column of the etymology languages. In general this code needs some rewriting given all the changes you and others have made to Module:languages and the data format. Benwing2 (talk) 02:15, 24 June 2023 (UTC)
Latest comment: 1 year ago1 comment1 person in discussion
Hi, I am proposing deprecating or deleting a bunch more form-of and lang-specific headword templates that are trivial wrappers; see WT:RFDO#June 2023. You have commented in the past about these; feel free to take a look when you have a chance. Thanks! Benwing2 (talk) 08:05, 27 June 2023 (UTC)
Latest comment: 1 year ago6 comments2 people in discussion
Hi. You added a bunch of Mongolian-specific tags in Module:form of/data2. I added support for language-specific tags to Module:form of so I'd like to move all the lang-specific tags out of Module:form of/data2 and into the appropriate lang-specific modules. Can you help me do this, either by moving the tags (which is not hard) or by identifying which tags are Mongolian-specific, so I can do it? See Module:form of/lang-data/sw for an example. All you need to do is create Module:form of/lang-data/mn along the same lines and move the tag data. This should help reduce memory pressure. Thanks! Benwing2 (talk) 02:34, 28 June 2023 (UTC)
OK, you also need to add the language code of the module with lang-specific tags to the table at the top of Module:form of. Without this table, pcall() on a non-existent module seems to waste a ton of memory. Benwing2 (talk) 02:49, 28 June 2023 (UTC)
@Benwing2 Have you tried using package.loaders to grab modules instead? It's the primary loader used by require, and returns a function if the module exists or nil if not. The returned function takes no arguments, and will return the contents of the module if called. Doing it that way means we can conditionally load modules without any risk of error, so pcall is never necessary. Not sure how much it'd help with the memory issues, but it might be worth a shot? Alternatively, we could write a new require function (safe_require maybe?), which has this stuff built-in. Theknightwho (talk) 12:13, 9 September 2023 (UTC)
I haven't yet but this seems like a good idea. I think it might also be possible to just check if the module page exists using the regular page-exists functions. Maybe that's considered an "expensive" call but it doesn't take up memory. Benwing2 (talk) 18:14, 9 September 2023 (UTC)
Latest comment: 1 year ago1 comment1 person in discussion
Hi. I got sick of the undiscussed piecemeal changes you're making to the core modules so I reverted some of them, which significantly brought down the contents of CAT:E (from ~ 50 to now 31, some of which are not related to memory or timeout issues). You've promised multiple times not to make core changes without prior discussion, and to create design documents outlining what you were going to do before doing them, and you've never actually done these things. From now on I'm going to be more aggressive reverting such changes; you need to test your changes in your userspace sandbox and then push them as a whole once your parser is finished and you've verified that the result actually does reduce memory and increase speed. I know you keep saying that isn't possible, but I've never had an issue with this approach even with 10+ modules needing to be copied into user space and their links modified. I think there's also another method that User:Erutuon uses that avoids the need to do the copy-and-modify-links approach; you might want to ask them. Benwing2 (talk) 05:39, 28 June 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
Hi again,
Actually, "high front unrounded vowel" is not necessarily accurate, and the transcription for that would be , not */i/. But the lexical transcription "ǣ" is specifically the FLEECE vowel, regardless of its phonetic value, so that is the more accurate description. It's not a "Wikipedia approximation", it's the standard lexical set -- it's the IPA that's a WP approximation in this case. Also, an example can be useful, and that example is the one given by the dictionary that was my source. kwami (talk) 19:59, 11 July 2023 (UTC)
@Kwamikagami That makes sense. It would possibly be good to deal with the lexical sets in a more standardised fashion (which includes a link to that Wikipedia page), as it would prevent misunderstandings like mine. Theknightwho (talk) 14:20, 12 July 2023 (UTC)
That's a good idea. I'll start adding links, as soon as I get done with all the fake "translingual" entries that are specifically Burmese. kwami (talk) 23:10, 12 July 2023 (UTC)
Are subscript hiragana common for Miyako?
Latest comment: 1 year ago2 comments2 people in discussion
I see the 'strawberry' example with ㇲ゙ has an alt form in all hiragana. That makes me suspect that katakana is only used electronically because it's available from Ainu (perhaps by Jlect's source). As does examples like っざら ~ ズざら. If hiragana is preferred in print, or even just common, that would be an argument for proposing them to Unicode, which I'd be happy to do. Are you familiar with the lit? kwami (talk) 20:58, 15 July 2023 (UTC)
@Kwamikagami I'm not sure as I'm not all that familiar with Ryukyuan languages, but I think you're probably right: compare this entry for むとぅびㇲ゙(mutubz) (with a syllabic む), which is given as an synonym of ㇺとぅびㇲ゙(mtubz). To me, that strongly suggests the small ㇺ and ㇲ゙ are being used as substitutes for the hiragana forms. I suspect that Miyako follows the same rules as Japanese with the use of katakana and hiragana, but it would be useful to get evidence for that.
I have a feeling that only Ainu and Taiwanese kana have been used as sources for the consonantal small letters (both of which exclusively use katakana), so any use in Miyako (or the other Ryukyuan languages) is likely just coincidental. Theknightwho (talk) 21:36, 15 July 2023 (UTC)
Sandboxes in language categories
Latest comment: 1 year ago2 comments2 people in discussion
@J3133 Thanks - it's because they're tests for the new wikitext parser I've been building, and the most realistic way to test it is to trick Lua into thinking it's actually on the page in question (and not in a sandbox). I've disabled them for now. Theknightwho (talk) 17:34, 16 July 2023 (UTC)
Latest comment: 1 year ago4 comments3 people in discussion
I don't think you noticed, but with this edit you stepped in something. As you may be aware, {{descendants tree}} is a complex and temperamental kludge that never quite does what you think it does. By changing {{desc|fr|cow-boy|cowboy|bor=1}} to {{desctree|fr|cow-boy|cowboy|bor=1}}, you told the module that there were Descendants sections in both French cow-boy and French cowboy. In the old days, that would have caused a module error, but I got tired of entries ending up in CAT:E every time someone moved the Descendants section from one descendant to the other (especially since the cause is often not in the same place as the error, which means the perpetrator is unaware), so I got consensus to change it. Instead of throwing a module error it displays an error message only visible in preview while adding a maintenance category so someone familiar with the languages in question can fix it later. @Fytcha was kind enough to implement it for me. It's not a perfect solution, because the fact that it pretends to be {{desc}} means that those who do things like you just did have to be informed that it only looks like the template works that way. Personally, I think we need to rethink the way {{descendants tree}} works. It should be possible to optionally have some parameters act like the parameters in {{desc}}- that is, just displayed as descendants on their own without having the module scan the targets for further descendants.
Also, the module only recognizes alternative forms that use {{alter}} or {{alt}} when it displays alternative forms. In a few cases, such as Middle English descendants with a gazillion alternative forms, I've exploited this by using {{l}} for the alternative forms that I didn't want to be displayed by {{desctree}}, and {{alt}} for the rest. Really, though, we should have a better way to control what displays. Another problem is that having multiple alternative forms that each have their own descendants results in a list of alternative forms followed by a list of descendants without a clear way to show which descendant is from which alternative form.
Then there are situations where the main entry that has the alternative forms section isn't the one that has the Descendants section: a clear case of lexicography (which form is or was the most common and correct according to contemporary usage?) conflicting with etymology (which one is the actual ancestor of the descendant in question?). There's no way for {{descendants tree}} to choose between the two, even when the default is wrong for the purposes of the template.
Anyway, I thought I'd make you aware of this so there will be one more person thinking about it. I have Special:WhatLinksHere/Template:tracking/descendants_tree/desctree-no-descendants bookmarked, and I do my best to fix what shows up there. There are occasions, like currently with Old Javanese kamantryan, where I'm not sure what to do- so I just leave it alone and hope that someone will preview the entry or see that it's in a maintenance category and discover there's a problem. So far, I've never had one of these remain in the maintenance category for more than a week or two. Thanks for reading all the way through this, Chuck Entz (talk) 00:28, 17 July 2023 (UTC)
@Chuck Entz I didn't write the {{desctree}} code but I've done some hacking on it. If you can make a concrete proposal as to new params/options to add, I may be able to implement it. Benwing2 (talk) 04:27, 17 July 2023 (UTC)
Thanks @Chuck Entz - that's useful to know. I would like to do a bit of an overhaul of descendants at some point (e.g. doing them in a single template call). It would allow for standardising the layout, and would hopefully prevent situations like these. Theknightwho (talk) 04:35, 17 July 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
Hi, I pinged you a few days ago in WT:Beer parlour/2023/April but didn't get a response. I am proposing renaming the qfa-sub-* codes to just sub-*, any objections? Also any suggestions you have about implementing script code aliases are welcome (see the BP entry for more info). Benwing2 (talk) 04:29, 17 July 2023 (UTC)
Hello again, I am starting to rename the anomalous script codes. This is significantly trickier than for the anomalous etymology lang codes, at least for 'polytonic' and 'Latinx', because they are used in so many places. I am following User:-sche's suggestion of five-letter codes with a capital initial letter; hence polytonic -> Polyt, Latinx -> Latnx, musical -> Music, Ruminumerals -> Rumin, IPAchar -> Ipach, while Morse and Semap stay as is. Let me know if you have any objections. I have set up tracking for use of the anomalous codes and added support for the new aliases in Module:scripts, but I haven't yet added the ability to use the new codes in places where script lookup happens (e.g. in per-script entries in the translit, entry_name, etc. fields), so those have to stay as-is until I implement that support. Once everything is converted I will rip out all the alias support to make Module:scripts and related modules leaner. Benwing2 (talk) 22:45, 17 July 2023 (UTC)
@Benwing2 That sounds good, as it'll make handling them more straightforward. I'm not fully convinced that we even have a need for a separate polytonic script anymore, since it's now widely supported, but that's a separate conversation. Theknightwho (talk) 22:52, 17 July 2023 (UTC)
Latest comment: 1 year ago6 comments2 people in discussion
why is this necessary? I don't see any functions in that module and it's loaded by all the language modules, which certainly makes it a memory suck. At least can you document in Module:languages/data why it needs to be required? Benwing2 (talk) 09:20, 17 July 2023 (UTC)
@Benwing2 It's Module:languages/data/patterns, and the reason is because the table has to be mutable as some things are conditionally added to it. As it's required by multiple functions, it's cheaper to put it in a separate module that can be required when needed, as opposed to just dumping it in the module (which would mean it got loaded every time no matter what).
The other thing to note is that all the mw.loadData calls are wrapped in a function called conditionalRequire, which uses mw.loadData unless a flag is specifically set to use require instead. This is used by Module:family tree, because mw.loadData is slow and really memory-inefficient for situations involving one massive invoke like category pages. We should probably update Module:category tree to use it, actually, as it might prevent the issue with certain language pages occasionally having memory errors. Theknightwho (talk) 09:26, 17 July 2023 (UTC)
OK given that this is rather subtle, can you *PLEASE* *PLEASE* document this in Module:languages/data? This should take you all of two minutes, just copy what you wrote above and expand on it slightly. I shouldn't have to ping you every time I need to understand something non-obvious about code you've written. Benwing2 (talk) 09:37, 17 July 2023 (UTC)
@Benwing2 It's used in Module:languages. I'm wondering whether this is actually necessary, since the table needs to be cloned anyway to stop the table in packages.loaded from having things added to it. I vageuly remember there was a massive unexplained slowdown that forced me to use require over mw.loadData (as in, pages would go from 2 to 9 seconds to load), but that I have a feeling that was when the function was still using mw.ustring.gmatch which has a bug like that. However, there's still a weird memory spike if I test it with conditionalRequire (or mw.loadData), which makes no sense to me. In any event, I'll document that it seems to be necessary to keep memory down. Hopefully this will all be immaterial soon, once the parser is ready.
The reason it's saved there is because it's pattern data for Module:languages, and I wanted it to be in a separate module to minimise overhead. It's got nothing to do with Module:languages/data, but the general convention seems to be that subpages with other names are for functions that have been hived off, while data is always under the data subheading. I don't mind if you want to move it. Theknightwho (talk) 09:49, 17 July 2023 (UTC)
Thanks. Please do add this documentation. It's very important for maintainability purposes, to reduce the dependency on "tribal knowledge". As an example of what I'm looking for, see the comment at the top of Module:affix about the different types of hyphens and affixes. This stuff is key to understanding Module:affix, and it's not super obvious, so what I did is essentially a "brain dump" at the time I was rewriting the module, which is when all the information is freshest in your head. This entire comment took about an hour to write, which is peanuts time-wise compared to the amount of time you've spent working on the code itself. You don't have to write something so detailed, but a paragraph or so would be very nice explaining your learnings about keeping the memory down and what is going on with useRequire, conditionalRequire and such (none of which is currently documented). Benwing2 (talk) 22:51, 17 July 2023 (UTC)
I'm not convinced by that discussion at all. The statement that the Nazis "never recognized the concept of the white race" is simply incorrect, because they were acutely aware of the parallels (and differences) between themselves and the American South at the time. Although they were focused on "Aryans", we are under no obligation to use their terminology. Theknightwho (talk) 00:04, 19 July 2023 (UTC)
It's still more precise and accurate to describe Nazism as an ideology of Aryan supremacy. The Nazis also regarded other groups of white people as inferior, such as Slavs, hence the OP's reference to it being like calling Hutu power a black supremacist movement. 81.102.123.10409:17, 19 July 2023 (UTC)
Right, but many Americans didn’t perceive Irish or Italian people as white during the latter part of the 19th century, because racial categories are woolly, subject to cultural biases and prejudices etc. To use your example: it’s pretty common for neo-Nazists in Western Europe and North America to say Slavic people aren’t white, and it’s a pretty universal belief of white supremacists that Jews aren’t white, when obviously the real answer is “it depends on the person”. It’s hard to ignore that both of those beliefs come straight from the Nazis.
There’s also the fact that the meaning of the word “Caucasian” has shifted over time: today it’s usually used to refer to white people, but the Nazis used it in a wider sense that covered many people from Central and Southern Asia. They used “Aryan” to refer to a subgroup of Caucasians, and there had been a lot of debate as to whether eastern or southern Europeans should be counted as Aryan in the decades prior (e.g. look at this early map); it was political factors that ultimately decided the answers. To me, it seems like a specific form of white supremacy.
There’s also the fact that white supremacy is a movement, as well as an ideology, and it’s important to recognise the role the Nazis played in that movement. We can’t really say the same for “Aryan supremacy”, and there’s no reason we have to use their terminology either. Theknightwho (talk) 09:53, 19 July 2023 (UTC)
Latest comment: 1 year ago9 comments2 people in discussion
I think it's becoming clear that your discussion with Koavf is not coming to a resolution. If I may be so bold, may I ask that you refrain from replying to him for a week, to give both of you some time to cool off? I will make the same request on his talk page. I think it will be better for everyone if we take a step back from all the drama in the Beer Parlour and elsewhere for the time being and maybe come back to it in a different way later, if necessary. Thank you, and all the best. Andrew Sheedy (talk) 14:44, 20 July 2023 (UTC)
Yeah, I agree. Would you be so kind as to step in to ask him to stop? I cannot see how his actions are anything other than an egotistical power-trip at this point. Theknightwho (talk) 14:50, 20 July 2023 (UTC)
I'm happy to mediate between you, but I think as far as WF goes, you two are the only ones still discussing the issue actively. Although I disagree with him, Koavf is perfectly justified in his approach to dealing with WF, so I think it's best to let the vote determine the outcome. Given what Koavf has expressed, I think he will abide by the results of the vote. I hope that the vote can remain civil, without any direct arguing, because I think everything that needs to be said has been said already. Andrew Sheedy (talk) 14:57, 20 July 2023 (UTC)
@Andrew Sheedy Thanks. I think it would be a reasonable return to the status quo (and would heavily reduce the potential for conflict) if other admins were to agree to treat WF as they had been before (i.e. blocking on occasion), but for Koavf to refrain from doing so. After all, that is precisely what caused all of this fuss in the first place.
It is very clear that Koavf does not have consensus (either among the community or among the admin team) to keep acting in this way, so I think it's a reasonable compromise. Theknightwho (talk) 15:01, 20 July 2023 (UTC)
Perhaps it would be better, but I suspect it would be a bit hard to legislate. I think the reality is that while there may have been consensus before, there is no longer consensus. Koavf has a number of supporters and the Beer Parlour discussion seemed to indicate a wide range of opinions all over the map. So while there isn't a consensus supporting Koavf, I'm not convinced there is a consensus against him either. Anyway, let's wait to see the outcome of the vote before trying to find alternative solutions. Andrew Sheedy (talk) 15:58, 20 July 2023 (UTC)
@Andrew Sheedy I see only one person who supports his current approach, though, and quite a number of people have expressed serious concerns about his attitude (including you). Theknightwho (talk) 16:06, 20 July 2023 (UTC)
To be clear - I'm not trying to be defensive about this, but I'm just concerned about the community being run roughshod over. Theknightwho (talk) 16:11, 20 July 2023 (UTC)
Even if there is a community consensus, if the community consensus runs counter to the rules, there isn't really a good way to justify preventing someone from following the rules if they want to. I'm not a fan of Koavf's approach, but he's right in that if there is in fact a consensus against the rules, the best way of avoiding conflict is to just change the rules. I don't want to get caught up in yet another discussion about the issue at hand, so this will be my last reply. See you in the vote. Andrew Sheedy (talk) 16:18, 20 July 2023 (UTC)
𛄠
Latest comment: 1 year ago2 comments2 people in discussion
Hi Theknightwho,
I am the author of virtually all of the content on the pages 𛄠 and 𛀆. You recently reverted my edit on 𛄠 redirecting most of the usage notes to 𛀆. There appears to be some confusion on kana equivalency, especially when dealing with historical kana.
First of all, your claim that "They aren’t even fully applicable." is not correct. Everything in the usage notes of 𛀆 is applicable to 𛄠. The only difference between them is that one is hiragana and one is katakana, which I made sure to account for by leaving {{ja-Kana-usage notes}} in. Conversely, the information in {{ja-Kana-usage notes}} itself is not very applicable to 𛄠, as it explains the modern usage of katakana, where 𛄠 has been obsolete for a very long time. During the period where 𛄠 was used, katakana and hiragana were used interchangeably between documents. Full-length documents where not a single hiragana was present - only katakana - were a common thing, an example being the script of the Japanese WWII surrender broadcast on 15 August 1945. Katakana was actually considered more official than hiragana not a very long time ago, a situation that has reversed today. In fact, the sources I used (1891 仮名遣 and 1897 日本大文典. 第1編) when creating the page 𛄠 uses it in ways that demonstrate the perfect applicability of the usage notes on 𛀆 to 𛄠. All the information I wrote pertaining to the evolution and scope of 𛀆 is entirely valid for 𛄠.
Furthermore, the information currently on 𛄠 is wrong. Quoting User:Eirikr,
There is no historical evidence that /ji/ existed in Japanese. There is circumstantial evidence for its existence prior to recorded history, but in the entirety of the Japanese written corpus, as best I understand it, there is no /ji/.
Consequently, the statement that "In modern Japanese, old /ji/ evolved into /i/..." is just plain wrong: there was no "old /ji/".
His complaint on my talk page is what led me to reform the page 𛀆 into what it is now. 𛄠 has not been changed since then, and it needs to be changed to remove the incorrect information. It is more unhelpful to leave the page as it is, than to replace it with a link to the usage notes of 𛀆, just like what was done on 𛄡 for 𛀁. LittleWhole (talk) 21:59, 22 July 2023 (UTC)
Latest comment: 1 year ago17 comments2 people in discussion
First of all, I took a look at the "Disallowed nesting of Malayic translations" abuse filter. The regex involved is awfully complicated and could really do with a good explanation of how it works. It uses some syntax that I'm not even familiar with like ?(1). I assume this is a PHP regex; if PHP supports Perl-style (?x...) regexes, you can use it to insert comments into the regex itself to explain how it works.
I wrote a script to unindent languages under Malay (especially) and Javanese. In the process it sorts all the translations; indented and unrecognized lines are grouped with the nearest language above them and kept in order if there are multiple such lines together. Some issues that maybe you could help with:
Should the script sort all translation sections or only those where an indented line needs to be unindented? There are an awful lot of mis-sorted translation lines. (The script does not currently sort lines under {{checktrans-top}} because of junk that occasionally appears there.)
The script doesn't unindent recognized scripts listed in place of languages; these are Carakan, Roman, Jawi, Rumi, Arabic and Latin. Not sure if this is correct.
The script indents the following languages, which include (almost) all the languages that appear under Malay and Indonesian: Acehnese, Ambonese Malay, Baba Malay, Balinese, Banda, Banjarese, Batavian, Buginese, Brunei, Brunei Malay, Ende, Indonesian, Jambi Malay, Javanese, Kelantan-Pattani Malay, Madurese, Makasar, Minangkabau, Nias, Sarawak Malay, Sarawakian, Sikule, Simeulue, Singkil, Sundanese and Terengganu Malay. Some of these aren't recognized Wiktionary languages, so I don't know if they should be unindented and sorted among other languages. Some should probably be renamed in the process (e.g. Brunei -> Brunei Malay). Note that neither Sarawak Malay nor Sarawakian (presumably the same thing) are found as Wiktionary languages, nor is Batavian.
The script doesn't unindent any other language variants under Javanese, which includes Central Javanese, Western Javanese, Kaili, Krama, Ngoko and Old Javanese. Not sure if this is correct.
When it comes to sorting, the script currently converts to NFD format before sorting by codepoint, so that e.g. Yámana comes before Yiddish, Mòcheno comes between Montagnais and Muong, Tày comes between Tatar and Telugu, Réunion Creole French comes before Romani, Māori comes between Maltese and Meru, Meänkieli comes before Megleno-Romanian, etc. Not sure if this is correct; the current sorting is inconsistent but often sorts these using NFC codepoint sorting.
By the same token, Oki-No-Erabu ends up before Okinawan when it seems it usually is currently sorted after. Maybe I should ignore hyphens when sorting (and apostrophes? Cf. S'gaw Karen, O'odham, 'Are'are).
Potentially the script could normalize lines beginning with ** (especially common with Chinese variants) to instead begin with *:. Potentially it could also correct issues like * Assyrian Neo-Aramaic {{t|aii|ܩܵܛܵܪ|tr=qaṭar}} (missing colon). Neither is currently done.
OK, I made it fix both issues under (7), as well as replace Unicode +00A0 with regular space, replace uses of {{ttbc}} and remove blank lines. Also the script could potentially fix places that put Ancient Greek or Modern Greek under a Greek header, if there's consensus to do that. Benwing2 (talk) 04:17, 23 July 2023 (UTC)
@Benwing2 Hi Ben - I’ll get back to you properly once I’ve had some rest, but just to explain ?(1): it’s a conditional based on whether the first capture group matched, and in this regex capture groups 1 and 2 are inside different sections of an OR non-capture group towards the beginning. What it’s basically doing is saying “if the language is the one in capture group 1, then match…”. It’s a way of carving out exceptions for allowed indentations in certain languages, while keeping it specific to those languages, which it does by using negative lookaheads if the condition is met. Theknightwho (talk) 04:25, 23 July 2023 (UTC)
OK, Thanks. I wonder if things need to be this complex; my checks for indented langs are pretty simple. I notice you only check for a few langs that could be indented when there are many more, as mentioned above (or maybe you check for all but those few named langs? I haven't yet figured out the regex). BTW my script is ready to go, pending resolution of the issues above. It would change 3,716 pages, of which 148 of them have indented Malayic translations (often several per page), and issues 594 warnings, mostly for unparsable lines in translation sections (and most of those are genuine, typically by a single language spilling the translations onto multiple lines, sometimes as in would by a sentence being translated into multiple languages, and occasionally by HTML comments inserted into the table giving hints as to what types of entries are acceptable). Benwing2 (talk) 06:39, 23 July 2023 (UTC)
@Benwing2 It’s the other way around - it checks to make sure nothing is nested under 4 languages, except the handful of script exceptions.
I’ll do some comments for it, which should make it clearer what’s going on, as the overall explanation of what it does is actually not that complicated. Theknightwho (talk) 11:14, 23 July 2023 (UTC)
Thanks. Any comments on the above issues? If not I'll go ahead and run the script. BTW User:Benwing2/unhandled-indented-malayic contains warnings (only 17) related to the indented lines that couldn't be unindented. I'm not sure how to handle them, you might want to take a look. Also you should probably make script exceptions for Arabic and Latin, which show up occasionally (unless you think they ought to have different names). Benwing2 (talk) 19:03, 23 July 2023 (UTC)
In theory it should sort all the translation sections, but I was mainly concerned with making sure there weren't any incorrect nestings.
"Arabic" corresponds to Jawi and "Latin" & "Roman" correspond to Rumi. For the sake of consistency, I went with the ones which were most common (which are Jawi and Rumi), so it's probably best to go with that.
In terms of nesting, I think the best approach would be to separate out any Malayic languages which we treat as full languages (e.g. list Brunei Malay under "B"), but if there are any that are not Wiktionary languages then it'd be best to flag them up for manual review. Another one to look out for is Indonesian, where some editors have just grouped any languages spoken in Indonesia under "Indonesian" as a heading, such as Javanese and Sundanese. I've sorted some of them out manually, but I'm sure there are more. We're also a bit inconsistent with our treatment of "Old X" and "Middle X" languages in general, so it'd be best to see what the general trend is before committing to indenting Old Malay under Malay.
Is it possible to use the UCA for sorting? That would bypass any of the NFD/NFC concerns. If you need a Lua implemention, Module:User:Theknightwho/sort has a function (sort) which takes a table of terms as an argument, and outputs a UCA-sorted table. I haven't implemented variable weighting yet (which applies to -), so you may have to manually disregard it. My inclination is to use English sorting rules in that regard, so ignore the apostrophe and dash (or more accurately, only give them a quaternary weighting so that they're a last-resort tie-breaker, which won't affect anything here).
Thanks. The script is simply leaving alone any indented languages it doesn't recognize, and unindenting/sorting the remainder, whether found under Malay, Javanese or Indonesian (there are none indented under Sundanese). I'll have it normalize Arabic -> Jawi and Latin/Roman -> Rumi. As for UCA, I'd need a Python implementation since that's what the script is written in. There seems to be one here: However I'm a bit leery of doing this because the sorting should ideally match what is done by the translation adder gadget. Potentially the gadget could call into your Lua implementation but we'd need to make it production-ready and put it in the production space, which I'm wary of doing until we have a clear consensus to do this. Benwing2 (talk) 20:11, 23 July 2023 (UTC)
@Benwing2 Great - thanks. That makes sense. In terms of sorting, the UCA can take what are called "tailorings", which are contextual modifications. These are often for language-specific needs (e.g. Azerbaijani "Q" goes after "K"), or more general ones (e.g. whether to restrict punctuation to quaternary weights because we don't really care about it). If we did implement this for the translation-adder, it would be a good idea to make any tailorings easily accessible to bots using other languages, which we could probably do with JSON. Theknightwho (talk) 20:18, 23 July 2023 (UTC)
OK I discovered that I have an existing script to sort language sections that uses the following code, so I am sorting language translations the same way:
@Benwing2 Check the notes in the abuse filter to see an explanation of how the regex works. I don't want to explain it publicly as there's a reasonable chance some users may try to subvert it if they know how it works (e.g. we already had one problem IP from Malaysia trip the filter, who then decided to nest Indonesian under a random language instead, so it wasn't just an innocent misunderstanding). Note also where they placed Malagasy, too. I blocked them for a week (as they've been doing this kind of thing for a while), but they'll no doubt be back. Theknightwho (talk) 22:11, 23 July 2023 (UTC)
OK, thank you! I took a look and it looks good. I am running my script now; it is still tripping the abuse filter on some terms because it doesn't fix everything, for reasons I've already mentioned. In some cases the page actually looks OK, e.g. many has two scripts (Carakan and Rumi) under Javanese; maybe the filter needs updating. Benwing2 (talk) 22:17, 23 July 2023 (UTC)
Hmm, I'm not sure. Currently the Roman -> Rumi change is not sensitive to Javanese vs. Malay, but I can fix this if you prefer. Benwing2 (talk) 22:30, 23 July 2023 (UTC)
So if you look up in Wikipedia, Javanese is said to use three scripts: "Javanese" (Carakan), "Latin" and "Pegon" (Arabic). Malay similarly has "Latin", "Arabic"/"Jawi", "Arabic"/"Pegon", "Thai", "Malay Braille" and some others used historically. This makes me wonder if we shouldn't use Latin in place of Rumi. Benwing2 (talk) 22:34, 23 July 2023 (UTC)
OK, I think Rumi is a misnomer for Javanese and Indonesian, which should use "Latin", while either Latin or Rumi is possible for Malay. Benwing2 (talk) 22:38, 23 July 2023 (UTC)
Latest comment: 1 year ago6 comments2 people in discussion
Hello, you mentioned that the sortkey was unnecessary here. Without the sortkey, in , the two words pronounced mendori, 雌鳥 and 雌鶏, are sorted under と and め respectively. I have two questions: how is 雌鶏 sorted under め in , and how can the two entries be sorted in as they should be? Mcph2 (talk) 11:59, 24 July 2023 (UTC)
@Mcph2 Hi - are you sure that's right? If I test the automatic sortkeys, {{sortkey|ja|雌鳥}} gives めんとり' and {{sortkey|ja|雌鶏}} also gives めんとり'.
That being said, some Japanese modules use other methods to sort terms (e.g. the katakana categories use katakana, and the kanji ones use radical sort). There might be something like that going on here, as they should both be sorting the same. Theknightwho (talk) 12:07, 24 July 2023 (UTC)
@Theknightwho: What I’m referring to is that ] shows the following:
@Mcph2 This seems to be something unique to {{pre}}, where it sorts "X terms prefixed with Y" categories by ignoring the prefix, presumably because the assumption is that it's unnecessary to sort with the prefix. I actually think this method of sorting isn't a great idea, because it ignores situations like Japanese where sorting isn't necessarily based on the orthography, as well as situations where there might be two or more variants of a prefix (e.g. English anthrop- and anthropo-) where we want to use the same category for both, and ignoring the prefix just leads to confusion. Theknightwho (talk) 15:01, 24 July 2023 (UTC)
@Theknightwho: Thanks for investigating. I’ll just add the sort=
Latest comment: 1 year ago2 comments2 people in discussion
Hi, I know you've been dealing with sortkeys lately. There are some entries in which some editors added |sort= when categorising using non-templates, which of course sorts the page at s rather than the intended one. See this search for example, and it appears that Japanese is the main culprit. I've been removing them when I was dealing with the out-of-memory pages. There might still be a few of them not shown in this search, and also it might be a good idea to abuse-filter them? – Wpi (talk) 16:07, 24 July 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
Memory is increasing. Is it possibly related to your change to the Han sort keys? Several pages are hitting or getting close to hitting the memory limit that didn't use to be anywhere close. Benwing2 (talk) 03:26, 26 July 2023 (UTC)
@Benwing2 I think it's unlikely to be related to that: the changes were minor, and strings are stored centrally in Lua memory (meaning they don't get duplicated for each invoke), which is why serialisation is an effective memory-saver in the first place. Theknightwho (talk) 03:33, 26 July 2023 (UTC)
Hmm. Can you explain what you changed in the latest day or so related to core functionality? What were these minor changes?
Also, I am trying to move {{quote-meta}} into Lua but it seems to be drastically increasing memory. Do you mind taking a look at the quote_t function in Module:quote, which is an approximate port of {{quote-meta}}, and comparing it to {{quote-meta}}, and letting me know if you have any idea why memory is increasing so much when I use quote_t? You can see in Template:quote-book the switch back and forth between using {{quote-meta}} and quote_t; using quote_t causes at least 7 or 8 pages, e.g. ban and min, to increase their memory by 2MB or more. In ban, the single call to {{RQ:Shakespeare Hamlet}} uses around 4MB of memory when quote_t is in place, and at most 2 MB otherwise. What confuses me is that the old Template:quote-book code also calls into Module:quote (the source_t function) to do most of the work, and all that quote_t and {{quote-meta}} do is paste together the source line constructed using source_t with the usex lines constructed using Module:usex. Both methods do essentially the same thing and I'm not adding any new parser frames. Benwing2 (talk) 05:56, 26 July 2023 (UTC)
@Benwing2 I’ve only really been working on the parser, so other than Module:Hani-sortkey 2 days ago I’ve not changed anything likely to affect a lot of pages.
I’ll have a look at Module:quote, as that does seem weird. I suspect it’s the random effects of Scribunto memory issues again, and we just don’t notice the pages where memory’s fallen by 2MB. I’m starting to think there’s something seriously wrong with Scribunto’s design which isn’t merely down to the Lua 5.1 garbage collect issue, because I never see any wild fluctuations like this when everything’s contained in a single invoke. Theknightwho (talk) 06:59, 26 July 2023 (UTC)
@Benwing2 Hiya - I’ve been using it with no issues today, but I think there’s a problem with the reply gadget itself as I’ve had that problem before at random intervals.
To be honest, the savings aren’t that significant, as the only way to get the edit links to work was to preprocess the entire pair of pages together, which has knocked out most of the savings we were seeing before. It’s the exact same issue with edit links that I raised a few months ago when discussing the idea of page parsing, and unfortunately there’s no native workaround.
This is also an issue that we’re probably going to have to deal with once the Wikitext parser is ready, to allow us to parse pages in one invoke. I did read something somewhere about generating edit links manually with JavaScript, which sounds promising. Theknightwho (talk) 21:46, 30 July 2023 (UTC)
Latest comment: 1 year ago3 comments2 people in discussion
Hi. I was looking at embedded_language_links in Module:links and I noticed that I don't see any of what I call "common-case optimizations". For example, whenever I implement inline modifiers, I make sure to check for < before loading Module:parse utilities and doing the full inline modifier processing. Similarly in my new code for Module:quote, which allows for all sorts of parameters to be in foreign scripts and potentially have inline modifiers (including page numbers, since they may use Eastern Arabic, Thai, etc. numerals), I have various checks to avoid most of the work in the common cases, e.g. no script checking if the text is all-ASCII, no inline modifier checking unless < is present, no comma-splitting/multi-arg processing (when such a thing is possible) unless a comma is present, etc. I think adding some of these could reduce memory significantly. Although it's true that there is a lot of statistical variance in Scribunto's memory implementation, the fact remains that we have 46 entries in CAT:E even after a ton of {{multitrans}}, {{*-lite}} templates, etc., when formerly there were no lite templates and only a few over-memory items, and this is likely to be due more to all the changes you made to the core modules than to additions to the pages themselves. Benwing2 (talk) 19:18, 31 July 2023 (UTC)
@Benwing2 Yes, I agree with you (and incidentally I’ve been doing a lot of work on that today, in an effort to get the parser timings down).
One thing that I’ve found is that sometimes these optimisations can be counter-productive, because anything that fails the tests will always be doing more work. For example, there’s not much point in avoiding loading things that use mw.loadData, because it only takes one failure on the entire page for the benefit to be cancelled out - and that’s extremely likely on a very large page. On the other hand, anything that avoids string functions is more likely to be beneficial. Theknightwho (talk) 04:21, 1 August 2023 (UTC)
Yes, I think there's no need to do this with loadData() but avoiding require()ing a large module can definitely be helpful, likewise avoiding string manipulation as you mention. Benwing2 (talk) 18:25, 1 August 2023 (UTC)
Latest comment: 1 year ago13 comments3 people in discussion
Can you take a look? When embedded_language_links() is called on certain pieces of text, an error results deep in some of the code you added. The text appears to be message-ID <AUTISM%[email protected]> (NOTE, it displays with less-than/greater-than signs but the actual wikitext contains HTML entities; see the source code of this message). Benwing2 (talk) 04:22, 1 August 2023 (UTC)
Will do. Those functions need a total rewrite anyway. I’ve been considering doing it by tokenising the strings, which allows us to do arbitrary lookaheads (among other regex-like things that we can’t do with ordinary patterns). Judicious use of repeat loops can make this extremely fast (e.g. the tokenisation of a now takes less than 0.2 seconds, down from 3 a month ago). Theknightwho (talk) 05:03, 1 August 2023 (UTC)
@Benwing2 I just did a test with ZeroBrane (a Lua IDE), comparing string.find's character set () with an iterator that checks a table lookup. The table lookup is over 100 times faster: it completes 1 billion iterations in under 2.5 seconds, versus 10 million in 2.9 seconds. I have a feeling we could take advantage of things like this if we developed our own library similar to Python regex library, where regex objects get compiled in advance. Theknightwho (talk) 20:40, 3 August 2023 (UTC)
Sounds like a project for a rainy day :) ... BTW were you able to figure out the bug on xyr etc.? There are still several pages in CAT:E due to this. Benwing2 (talk) 00:17, 4 August 2023 (UTC)
@Benwing2 I’ve got a hunch that it’s going to be a pain to solve, so I’ve been procrastinating on it, but I’ll make some time tomorrow to take a look. If it’s due to a design flaw it might be worth taking the opportunity to do a rewrite tbh. Theknightwho (talk) 02:18, 4 August 2023 (UTC)
Sorry to bug you but can you take a look at this? The errors are still present in CAT:E, and it looks like the fix shouldn't be hard as we know what is triggering the error. Benwing2 (talk) 04:55, 7 August 2023 (UTC)
@Benwing2 @Chuck Entz It's an issue with mw.uri.decode, which treats %93 as a percent-encoding (for byte 147, in this case, which made the text invalid UTF-8). We should only be applying mw.uri.decode to text in links, which will require a proper overhaul of the code. As a temporary solution, I've set Module:links to escape any percent signs in non-linked text, which cancels out the issue.
This will be much easier once the parser is finished, as we can simply use that to analyse anything put into the link templates, which should greatly simplify a lot of the code in Module:links. Theknightwho (talk) 21:34, 7 August 2023 (UTC)
Latest comment: 1 year ago13 comments3 people in discussion
Hey Theknightwho.
I disagree with the obsolete kana as being romanized as if they are in the a-row. It has been inconsistently romanized one way or another, but I beliveve the w-row romanization is more accurate. The three extinct/reconstructed kana are not exceptional in their theoretical Hepburn romanization.
I see obsolete kana on lines 27 (the や行), 28 (an additional one-off for an alternative ye kana), 32 (the わ行), and 33 (the small-ゎ行). None of these are on the あ行...?
There are additional mystery glyphs on lines 75 and 82, but I don't have the right font installed to see what these are, and I can't tell from context what they're supposed to be. I also cannot select these for further inspection, as apparently they're some kind of combining glyph -- attempting to select them always includes the preceding apostrophe.
@Eirikr They're referring to ゐ(i), ゑ(e) and を(o), which they want to change to "wi", "we" and "wo" respectively. I've not got a strong opinion, but I was under the impression that the current approach is what has consensus. Theknightwho (talk) 16:47, 10 August 2023 (UTC)
Aha, thank you!
Hmm, that's a tricky one. It really depends on context, in a way where I think the only reasonable approach is what we currently have -- automatically output the straight vowel values, and editors can override as needed to specify the /w-/ glides.
... Thinking it through some more, I can say that automatically outputting the glide values /wi, we, wo/ would definitely be a mistake.
In pretty much all cases I can think of, the ゐ and ゑ kana are used as stylistic spellings that don't affect the pronunciation. The exception is the brand name for Yebisu Beer, which is spelled with the ヱ katakana. This is technically the we kana, but due to historical sound shifts, the pronunciation was actually ye during the Edo period, collapsing to just e within the past few centuries (at least, in mainstream Japanese; I'm unsure of possible dialectal use). And even though it's spelled ヱビス and romanized as Yebisu, nowadays it's pronounced as Ebisu.
And を is nearly always pronounced as straight /o/, with basically the only exception being hyper-careful speech where the speaker is deliberately chewing the scenery to over-pronounce the particle.
@Eirikr I actually encountered this yesterday, when adding ウヰスキー(uisukī) and ヰスキー(isukī) (dated forms of ウイスキー(uisukī); the latter especially so). I've left it as the default, as I was unsure what the best approach to take was. Theknightwho (talk) 17:06, 10 August 2023 (UTC)
Hmm, hmm, ya. When we're recording a historical term where a /w/ glide was actually pronounced, as in these two entries, we probably want to override the default behavior to add in that ⟨w⟩. ‑‑ Eiríkr Útlendi │Tala við mig17:21, 10 August 2023 (UTC)
(たびたびの投稿ですみません)
Also, kana usage is changing, as is the "default" Japanese sound system. We might discover that these kana are seeing increased use for modern terms, in which case we may want to revisit our handling. ‑‑ Eiríkr Útlendi │Tala við mig17:22, 10 August 2023 (UTC)
@Eirikr I think it might be worth changing ゐ(i) to "wi", since it doesn't have the particle issue like を(o), and I find it extremely unlikely that it will have developed analogously with ゑ(e) due to the fact that "yi" is nonstandard. Although it's usually pronounced "i", it's comparatively rare, and would be more etymologically accurate. Theknightwho (talk) 17:36, 10 August 2023 (UTC)
I'm okay with that. Dunno about others, and I'm open to the idea that I'm missing something or otherwise off-base -- might be good to canvass others' opinions, perhaps at the WT:BP? ‑‑ Eiríkr Útlendi │Tala við mig18:44, 10 August 2023 (UTC)
My opinion is to stick with "wi" and "we", which are the ones I've changed.
The "o" pronunciation for "wo" is just a specific usage of the kana warranting the need of another pronunciation (for historical reasons), the others are "he" pronounced as "e", and "ha" pronounced as "wa".
Basically, just go with "wi" and "we". It is ambiguous to change them to i" and "e", even though the kana are made obsolete because they are pronounced like the a-row kana in modern Japanese.
Additionally, I saw a few artifacts regarding syllables similar to ヰャ. It should be "wya", not "ya". Should we state that "ゐ" should be "wi", and "ゑ" as "we"? It is honestly getting confusing. --MULLIGANACEOUS-- (talk) 04:00, 16 August 2023 (UTC)
Thai language templates
Latest comment: 1 year ago2 comments2 people in discussion
Why do you replace the existing Thai language templates with nonspecific templates? Just want to ask about the reasons.
Wouldn't it be easier to directly modify the Thai language templates instead of replacing them with nonspecific templates? New Thai entries are created every day, and the Thai language templates will surely be used, and you will have to replace them again and again without end (not to mention that your replacements will likely be reverted).
@Miwako Sato Two reasons: they don’t do anything special except add labels that duplicate the heading, and they present a maintenance headache for those of us that maintain the core modules. The ultimate aim should be to delete the Thai link and list templates (but not the pronunciation one or headwords etc, as they clearly do need to exist). Also, please don’t revert these changes. It would be totally pointless.
The reason they exist is because they pre-date a time when the generic modules could handle Thai properly, but that’s no longer the case. They simply don’t have any reason to exist anymore. Plus, they use a slightly different syntax that’s now unnecessary, and there’s no good reason to have inconsistencies like that.
They were also added by Wyang, who had lots of good ideas, but wasn’t very good at coding and was absolutely terrible at integrating their modules into the core modules, which means the ones they wrote tend to end up stagnating in an unmaintained state, where they don’t get to take advantage of any updates to the main modules. That’s a problem for users who aren’t expecting that, and it creates a lot of extra work for people who do module maintenance. I’ve had a lot of problems with Wyang’s Chinese modules, but the Thai ones are thankfully not as bad as that.
Latest comment: 1 year ago4 comments2 people in discussion
Hey Theknightwho,
It has been almost two weeks and this issue of "ゐ" being "i" hasn't been resolved. I noticed that "ヰャ" has been romanized as "ya" though it should really be romanized as "wya".
It is not consistent that its small counterparts are romanized as usual.
Did you have a chance to take a look? Nothing has changed recently in the affix template code and the pages in question haven't changed in a long time, either. Benwing2 (talk) 02:25, 22 August 2023 (UTC)
@Benwing2 Not yet, but it’s a priority. I recently standardised the names of most pages with unsupported titles as a way to simplify the code that deals with them, which shortened the list of hard-coded ones to 6, down from 123. However, I don’t think that’s the cause, because there’s a separate bug with the affix template not removing the fragment, because it happens regardless of the target. For example: {{af|en|fun|-y#English}} wrongly gives fun + -y, whereas {{l|en|-y#English}} correctly gives -y. Theknightwho (talk) 14:13, 22 August 2023 (UTC)
@Benwing2 This is not a particularly simple fix, unfortunately. The issue is down to Module:affix treating the fragment as part of the term when generating the display_term. At Module:affix#L-743part.alt is set as equal to display_term if no alt text has been specified, and this is then put into the link template. The link template will only remove the fragment if no alt text has been specified, which is why it's leaving it in place in the display. The categorisation is handled wholly within Module:affix, but it essentially boils down to the same issue with display_term fragment being treated as part of the term.
While we could patch this by handling the fragment specially within Module:affix, that's not ideal because (a) it's duplication of work, and (b) it makes the handling of fragments more fragile, because any changes will need to be changed in both places.
Thank you for looking into this. It sounds to me like we need a way of calling into Module:links to remove the fragment before processing categories and such. I think the cleanest way of doing this is to have an additional entry point in Module:links that parses off the fragment and returns it, and then allow the fragment to be specified as an additional field in the data structure sent to full_link(). This is a bit similar to how you can specify alt text either in the link itself using ] or through a separate field. If you can add the entry point in Module:links, I can fix up Module:affix to use it. Benwing2 (talk) 05:40, 24 August 2023 (UTC)
@Benwing2 I've separated it out as get_fragment, which can be called externally. You can specify the fragment in the input data by using the fragment key. It's not ideal, but it'll do for now. Theknightwho (talk) 08:02, 24 August 2023 (UTC)
memory errors once again
Latest comment: 1 year ago4 comments2 people in discussion
I reverted your addition of a bunch of combining character stuff to Module:utilities/data because it was causing ~ 25 more memory errors. This isn't the first or second (or third) time this has happened; can you *please* try to avoid this happening in the future? I don't understand what the purpose of the combining char table is, and why it needs to be in Module:utilities/data; you haven't documented it anywhere and it seems unnecessary. In general you seem very cavalier about memory errors, but they cause a lot of problems for users. Benwing2 (talk) 03:31, 23 August 2023 (UTC)
@Benwing2 Thanks, yeah, I'll change that. It'd be good to get these converted en masse, but the categories should make keeping an eye on them straightforward. Theknightwho (talk) 19:43, 25 August 2023 (UTC)
bug in explicit fragment handling?
Latest comment: 1 year ago8 comments2 people in discussion
full_link() does not appear to be respecting a fragment passed in. I think the problem is around line 812, which isn't setting the fragment in the split parts of a link. Benwing2 (talk) 02:50, 26 August 2023 (UTC)
Did you have a chance to look at this? It is affecting all terms that use fragments, in addition to the other bug mentioned in the Grease Pit related to parsing fragments. Benwing2 (talk) 17:48, 27 August 2023 (UTC)
@Benwing2 Hiya - this is done. I tried to look into what was causing the other bug with compounds (e.g. Percolozoa + -an). It seems to occur when there's a fragment nested inside a raw link, which is why it always fails with nested templates: (i.e. {{af|en|{{l|mul|Percolozoa}}|-an}} is equivalent to {{af|en|]|-an}}). This seems to have been introduced with your recent changes to Module:affix, so it's possibly better for you to have a look at it. Theknightwho (talk) 16:49, 29 August 2023 (UTC)
Hi. Thanks for fixing the first bug. The second bug is definitely a bug in get_fragment() in Module:links. Module:affix simply calls get_fragment() on each term and it's clearly looking for # in all circumstances instead of ignoring fragments inside of links, which it should do. Benwing2 (talk) 18:16, 29 August 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
Can you help me understand why {{ja-usex}} exists? I would like to add |origtext= as a parameter similarly to what I did with {{usex}}/{{quote}}/{{quote-*}} but I don't really understand why there's a separate Japanese version at all, much less one that doesn't even use Module:usex as its underpinnings. Benwing2 (talk) 20:00, 27 August 2023 (UTC)
@Benwing2 I think it's plugged into Japanese transliteration (and possibly other Japanese-specific things), but I don't really know. We'll probably need to add Japanese transliteration to the main modules before we can make it redundant. Theknightwho (talk) 20:39, 27 August 2023 (UTC)
@Benwing2 There's also Module:ryu-usex, which is a shitty fork from a few years ago. I've been slowly deprecating/deleting the Okinawan forks as the process of converting the Japanese modules into Japonic modules advances, but haven't tackled that one yet. Theknightwho (talk) 00:20, 28 August 2023 (UTC)
Yes, so much shitty CJK code. There's also Module:Quotations, which is somewhat questionable and doesn't use Module:usex. The least we can do is use Module:usex for formatting quotations/usage examples everywheree; it should have all the necessary functionality now and if not we can add what's needed. Benwing2 (talk) 00:39, 28 August 2023 (UTC)
Just to save you some time:
Latest comment: 1 year ago5 comments2 people in discussion
@KoavfI warned you that I would block you if you continued to be disruptive, and you decided to be disruptive anyway, which went over the line when you decided to derail the discussion by accusing @Vininn126 of personal attacks. He was absolutely correct that you'd changed the subject away from your misbehaviour as an admin, which you conveniently ignored, and it came off as textbook evasion. Given your relentless habit of sealioning in every discussion, I don't think there's much point in me trying to convince you it was justified. I will not be responding further. Theknightwho (talk) 23:21, 28 August 2023 (UTC)
@Koavf Alright, given you clearly do want a response: the next time you tell someone they've ignored your question in response to them answering it directly, I will block you again for a month. This is exactly the kind of disruptive rubbish that amounts to sealioning, because you consistently refuse to engage in a reasonable way if people don't agree with you. Theknightwho (talk) 23:58, 28 August 2023 (UTC)
lol, I didn't do that. I asked which diffs justify me being banned and you responded with a diff of you writing that you'll ban me. The question was: which diffs of edits I made justify me being banned. You're a funny guy. Looking forward to your answers to the questions I asked on the Beer Parlour. —Justin (koavf)❤T☮C☺M☯23:59, 28 August 2023 (UTC)
Japanese memory errors
Latest comment: 1 year ago2 comments2 people in discussion
There were around 39 yesterday in CAT:E and now there are 49. The additions all appear to be single CJK chars. Could it be due to this? Benwing2 (talk) 23:50, 29 August 2023 (UTC)
@Benwing2 It’s more complex than that and unlikely to be down to just that change, because the Japanese modules were using a hodge-podge of functions with overlapping purposes, and I’ve been unifying them. Rolling it back in isolation will break some templates that rely on {{xlit}}, so I’ll see what I can do to reduce the load. Theknightwho (talk) 00:34, 30 August 2023 (UTC)
Need help on module upgrade on mswikt
Latest comment: 1 year ago3 comments2 people in discussion
Hai there! I'm Peace, and I'm currently a moderator of mswikt, right now upgrading our really out-of-date modules to match the ones here in enwikt. I have some issues with Module:category tree (especially the poscatboiler part) that I can't seem to solve. One thing that bugs me is this error here; I think the topic cat part of the category tree is fine. Can you help me identify what's the issue? I assume it's either an issue of poscatboiler/data/languages or Module:families (and/or its sister modules). Thanks!
Take your time! Anyway, I update some of the modules (plus, adding some missing modules), so the errors are a bit different now. Here are some issues I found, among others:
Regarding template:inherited, it seems that the template is unable to pick up proto-language codes inserted (things are okay with sister templates like template:bor, i.e. non-reconstructed language codes), but I checked the language module datas and it seems fine.
I found out that poscatboiler seems to can't pick up {{{langcat}}}.
May I know where in the modules, is the code that set up the "lang + label" format of poscatboiler categories to suit our format at mswikt as I can't seem to find it?
Latest comment: 1 year ago33 comments3 people in discussion
Hi. There's a bug in Module:links; embedded_language_links() has data.sc param that's a script object, but it calls process_embedded_links(), which tries to access data.sc as if data.sc is an array of script objects. In general I'm trying to make sense of Module:links (which you've completely rewritten), and it needs some cleanup; I can't figure out what some of the functions do, they're not well documented, and it appears there isn't good separation of concerns. This suggests it need some major refactoring. BTW what about the still-extant bug mentioned above in "explicit fragment handling"? It's been almost 3 days since that bug has been reported and it doesn't seem it should be hard to fix. Benwing2 (talk) 13:32, 30 August 2023 (UTC)
@Benwing2 I was actually writing a comment to you about these very issues, but it essentially boils down to the same thing. Once I've cleared up the obvious outstanding bugs I'm going to do a major rewrite, because the current model's got a lot of problems. My experience with the wikitext parser means I'm in a much better position than I was when I originally wrote these.
I've been thinking that we probably want to use string mixins for the text input, which would allow us to avoid the limitations of the string library. This would need to be done carefully, to avoid making language-specific stuff too complicated for the average editor to specify, but it would massively simplify the workload. Lua is extremely capable for this sort of string manipulation, so it would be good to take proper advantage of that. Theknightwho (talk) 13:41, 30 August 2023 (UTC)
@Benwing2 Character arrays that represent strings; there's probably a better term for them. You then iterate over the array and manipulate it with a lot more freedom than you can with the string libraries. If you really get under the hood, this is actually what the string libraries are already doing; literally in the case of ustring, while the native string library does it in C. However, they don't have things like lookaheads/lookbehinds, or-operators for strings of arbitrary length etc. Theknightwho (talk) 13:55, 30 August 2023 (UTC)
@Theknightwho Hmm, that is possible but IMO it's likely to be slow and hard to use in comparison with Lua patterns. The ustring libraries convert to PHP regexes, which use C under the hood. Nothing is being done in raw Lua; it would be really slow. Benwing2 (talk) 14:26, 30 August 2023 (UTC)
@Benwing2 It's how the parser works, and I've found it's extremely fast. With the newest optimisations, the parser now manages that stage of a in 0.15 seconds. It's the sort of thing Lua does best: it's about 10 times faster than ustring, and comparable to string (slower for raw characters, but faster for character classes). Theknightwho (talk) 14:28, 30 August 2023 (UTC)
@Benwing2 Thanks - I got my wires crossed. I've made some decent progress on this, using mwparserfromhell as a base. Performance still needs to be optimised, but it it can currently parse internal + external links plus HTML tags, which are the trickiest parts. Doing it this way makes a ton of the old code redundant, and also means wikitext parsing can be dealt with in a self-contained way. Should be ready in a few days, but it'll probably take a couple of weeks to rewrite Module:links and Module:languages to take advantage of it. Theknightwho (talk) 20:08, 31 August 2023 (UTC)
Even when placed inside double square brackets: ] is ].
HTML tags, including arbitrary classes and styles. This makes nested templates trivial to handle.
Invalid or disallowed tags are correctly treated as raw text.
Italics and bold.
HTML entities, which are decoded at the end of the parse to allow onward processing.
HTML comments.
Strip markers.
Lists (*, #, :, ;).
Horizontal rules (----).
Wikitables (of arbitrary size and complexity).
To do:
Headings.
Percent encoding inside links.
Decoding <nowiki> strip markers at the end of the parse.
Compressing multiple consecutive whitespace to a single space (i.e. what is actually displayed). This may need special consideration, as it might not always be desirable.
Pipe trick and trailing link suffixes.
Automatic detection of link prefixes.
Files (]), which can take multiple pipes + parameters).
Category and interwiki links, which need special handling.
Colon trick (]).
Not in scope:
Templates and parser functions.
Arguments.
Parser tags (e.g. <nowiki>). This means <pre> tags don’t have their nowiki property, which is the correct behaviour if they’re added via a module and not in raw wikitext; no other tags can be both.
Special handling of HTML comments, where new lines that only contain comments are fully deleted.
Anything listed as out-of-scope is handled by the other parser I’ve been writing, which I used as the model for this one. In short, this handles everything that’s done after template expansion.
Although it can cope with very complex inputs (many of which will be rare), they don’t add unnecessary load because it’s all done in a single iteration over the string, with the relevant functions being accessed only when necessary. The current designs of Module:links and Module:languages iterate over the string (at least) once for every type of processing that they do, which is one of the reasons they’re so expensive.
In some cases, we may actually want to change the behaviour away from the default parser (e.g. treating interwiki links as normal ones, or not treating ] as ], but simply a link to sms:a instead). The current design matches the default parser exactly, since that made sense as a starting point, but some of those assumptions may not make sense for the type of text that’s going to be input to templates. Theknightwho (talk) 15:53, 2 September 2023 (UTC)
@Benwing2 It’s on hold until I can work out how best to handle the template tree, which is built from the tokenised string. At the moment, it’s about 15-25% slower than than the native parser on very large pages, but I think it should be possible to reduce that.
The overall design is that it does a minimal parse when constructing the template tree of each new template that it encounters, which means that it only parses so much as is absolutely necessary (e.g. at the first pass, neither outcome from {{#if:}} is expanded). That tree is then stored, and each time the template is called, it expands the parts of the tree that are relevant to the given arguments (so if the is true, it’ll expand only the first parameter). Anything that’s permanent from that expansion (e.g. {{PAGENAME}} will never change) is supposed to be expanded in the cached tree as well, which means that any other templates that use that branch can take advantage of it. This is useful with {{t}}, for example, which checks the namespace every time it’s called to see if it needs to be categorised. Ideally, that check would only be done once, and later calls would simply get the appropriate bit to parse, bypassing that check. That’s the part that’s proving tricky to do, because it requires decoupling the specific pass from the cached tree in a way that doesn’t trash the cache or force the pass to do more work than it needs to.
@Benwing2 Yes, thunks sound like exactly what I need.
In terms of the postprocessor, percent encoding, headings and nowiki tags are now working, but there are a ton of weird and wonderful edge cases that need testing across the board.
I've also done some experimenting with the overall design, because I was getting concerned with the number of conditions that each iteration was having to check against, to account for all the numerous possible context states. Instead, it now uses a simple loop that constantly iterates over self.handlers, with a bunch of different handler tables being swapped into that key as contextually appropriate. This means:
It's trivial to bypass checks which aren't relevant.
It keeps the code compartmentalised. This makes introducing new features much more straightforward, since there won't be a ton of obscure dependencies to be concerned about. This will be useful when it comes time to add support for things like ^ capitalisation, rubytext support, your angle-bracket syntax etc.
It's easy to avoid duplication, because the same handler function can be used in multiple handler tables. It's also possible to use object inheritance between handler tables, which is useful if you need to bolt on/swap out a handler temporarily, or to tell it something like "use X handlers as normal, but check Y once that's done".
One other thing to note is that Scribunto has a C stack size limit of 200, for some reason. Since the parser uses a ton of recursive xpcalls, it is actually possible to run up against it if you make the wikitext complex enough. This hasn't affected the postprocessing parser (i.e. this one) yet, but I did manage to find a real-world example which hit the limit for the preprocessor, which had 100+ nested uses of {{#if:}}. This is potentially a risk when it comes to HTML tags and/or tables, since they can be nested to arbitrary levels. Theknightwho (talk) 22:32, 9 September 2023 (UTC)
Hmm, 200 is not a big stack size. If this becomes an issue you might have to rewrite the code to use your own stack (which is always possible but sometimes a bit painful, depending on how your code is designed and how much branching there is). Benwing2 (talk) 22:59, 9 September 2023 (UTC)
@Benwing2 Thanks. I got around the issue with the preprocessor by explicitly returning the BadRoute object instead of throwing/catching it, but it makes the code more complex since you need to account for every return path, and they aren’t always obvious. We’ve got a lot more moving parts here, and it was too much of a headache to keep track of. Theknightwho (talk) 23:18, 9 September 2023 (UTC)
@Benwing2 I’m making decent progress on this, though it’s a little slow as I’m trying to be thorough while also minimising overhead, which has entailed rewriting things a few times. Some design questions:
How closely do we want to imitate the native parser? I keep running into situations where I’m having to write special cases to account for bugs in the native parser, which is a bit of a pain. It’s always going to be necessary to some degree, but there are some really shitty hacks that feel silly to replicate, because all the functions have clearly been written by different people at different times. For example, ISBN 978-0000000000 will work (ISBN 978-0000000000), but ISBN 978-0000000000 will not (ISBN 978-0000000000), despite both entities being for the same character. There are tons more examples like this, as every function seems to roll its own regexes and iterators for everything, so things like whitespace, multibyte-characters and HTML entities get handled very inconsistently. Obviously most of them won’t matter, but there’s always a risk that something actually depends on a bug like this.
What would the most useful structure for the output be, and what methods should we have for processing it? At the moment, it tokenises a Wikitext string into an array, with special objects to represent specific formatting (for example, the beginning of a wikilink or the separator between an external link and its display text). However, it might be useful to have more of a tree structure, where a link, HTML tag or whatever is packaged up in its own object in the string, and can be accessed or skipped as necessary. The best structure will be dictated by the methods we want, so it’d be good to have an idea of what those should be.
We’ll need some way of converting the array into an output string, but do we want to be able to recreate the exact input? The native parser and mwparserfromhell both place a heavy emphasis on exact round-trip conversion, but I think we’d be better off just normalising the output, since I don’t think it matters for our purposes if <br> becomes <br/> or whatever.
Thanks for all the work and sorry for the late reply. We should ask ourselves what the purpose of the parser is; AFAIK it's simply to reduce memory in page rendering and potentially speed it up, right? And do it less painfully than other possible solutions. That suggests that (a) we should not try to be bug-compatible with the native parser, and (b) it doesn't matter if we can recreate the input with the parsed structure since we're never going to do that (I can't imagine why they care in the native parser either; mwparserfromhell is a different story since it's specifically intended for people manipulating the wikitext and putting it back). I would say in general, if some edge cases are handled slightly differently, that's fine; people don't create pages by reading the (nonexistent) parser spec and coding to that, instead they try to see what happens when they code things a particular way, and it's probably unlikely there are any pages depending on these really edgy buggy edge cases. (And if there are, well, there are a ton of pages that are currently broken, so it's unlikely to matter if one extra page gets slightly broken. Furthermore, we should operate by enabling the parser at first only on an allow-list (whitelist) of memory-intensive pages, and then maybe only on pages with short names, and maybe never on all pages, so the "blast radius" will be less. The more memory-intensive pages get more eyes on them, so any behavioral differences are more likely to get caught.)
As for the output structure, I'm not sure; we should definitely enumerate all the potential use cases of the parser up front, to avoid having to do major redesign work if it turns out there's a significant use case we missed. Benwing2 (talk) 18:53, 18 September 2023 (UTC)
@Benwing2 It's partly about memory reduction, but it's also about removing the need to account for wikitext while manipulating input strings, and having a consistent, reliable way to know what a given string is for. That greatly simplifies the amount of work that needs to be done in Module:links, Module:headword and Module:languages, and also allows us to do more sophisticated processing in substitution modules (translits etc).
A major side-effect of that is memory reduction, which is the reason I started doing this in the first place, but it's definitely about more than that at this point: there's a lot of potential here, if we can get it right. Theknightwho (talk) 19:09, 18 September 2023 (UTC)
Edit: just to point out that this isn't the same as the preprocessor parser, which is solely about memory reduction. The savings with this one will be relatively modest by comparison. It's a totally different approach, as it's about building a modular structure for parsing strings which we can build features into. It's about code cleanup and reducing duplicated work, as it strips away a lot of the spaghetti code we have at the moment. Theknightwho (talk) 19:14, 18 September 2023 (UTC)
@Benwing2 I'm not sure it'd be possible to do a partial implementation, as it would act as a replacement for what's there at the moment. I'll let you know once it's in an alpha state so you can have a play around with it, but at that point we can get an idea of how the core modules would need to be modified. Currently, the output is in this format:
Text
Text
Text
Where text represents any number of characters as strings. Obviously it can get a lot more complex than this, and the special tokens are tables, so can store any amount of data (e.g. for representing complex HTML tags). Theknightwho (talk) 19:34, 18 September 2023 (UTC)
You definitely need to implement this in a way that we can transition bit-by-bit, even if that requires extra effort. There's no way something this major can be cut over to in one fell swoop. Benwing2 (talk) 19:43, 18 September 2023 (UTC)
@Benwing2 I’ve had some real breakthroughs with the template parser in the last few days, and it’s finally starting to outperform the native parser in terms of speed; the memory savings are still hovering around a ~75% reduction, which I suspect is a hard floor.
It’s in its 5th or 6th total rebuild at this point and not all of it has been converted to the latest version yet, so once I’ve updated all the parser functions (etc) into the current spec I’ll give you a buzz so that we can start doing proper testing. It’d also be good to get a couple of pilot pages set up soon, too, since we’ll need to decide how best to deploy it on pages. Theknightwho (talk) 03:59, 24 October 2023 (UTC)
@Benwing2Good news! The memory limit's just been increased to 100MB, and the garbage collection is now much more aggressive, so the memory issues have now all-but disappeared. That being said:
The template parser will still be useful if we ever need to circumvent the problem of passing information between invocations, so it's good to have it up our sleeve. Plus, no doubt memory issues will return in the future.
I'm going to keep ploughing ahead with the wikitext parser (i.e. the one that handles wikilinks, html etc), because the efficiency gains were only ever a secondary reason for it, and it has the potential to greatly reduce the amount of spaghetti code we have.
By the way, this change now makes it possible to introduce major features, and we can probably get rid of some of the serialisation, too. We still need to think about speed, though, since some of the very large pages are far too slow. There's a huge amount of scope for that, so I'm not too worried.
@Theknightwho Yup, I just checked and most memory errors are gone. I haven't checked the usage of some of the pages that used to be on the list but I imagine they have gone down significantly with the more aggressive GC'ing. Do keep working on your wikitext parser. Let's not introduce too many features right away and let's have them pass through a review stage prior, to make sure we don't end up using up the extra memory. Benwing2 (talk) 18:38, 24 October 2023 (UTC)
I think the GC changes haven't gone live yet, because they are still in "patch for review" stage and a new MediaWiki version will have to go live after the patch takes effect. It sounds promising so I look forward to it. I'm stunned that a developer has finally decided to help Wiktionary out, because I had given up hope. — Eru·tuon18:57, 24 October 2023 (UTC)
Latest comment: 1 year ago14 comments4 people in discussion
You presumably changed something in the handling of standardChars; we have a large number of wanted categories of the form described above, which seems wrong. Benwing2 (talk) 07:45, 31 August 2023 (UTC)
@Benwing2 So when I changed the handling of standardChars a while back to support non-precomposed characters (e.g. Russian ѣ̈(jǒ)), I added two main rules for determining the categories:
If it's composite, check whether the combining character(s) are included anywhere in standardChars, and add a category for any that aren't (e.g. Category:English terms spelled with ◌́).
I kept a small number of languages on the old system because they weren't straightforward to convert: the most challenging being Hindi and Lao, as they're abugidas. I switched Hindi over last week. The new system means that terms with ं, ◌ँ and ः (among others) also get put in categories for the syllable. I don't think this is a major problem, but we could handle abugidas (etc) differently if we want to. Theknightwho (talk) 08:10, 31 August 2023 (UTC)
Chars with the anusvara (ं) should definitely NOT have categories because this is the standard way of writing nasal consonants. Likewise, chandrabindu ◌ँ indicates nasal vowels and should probably not have categories. These are standard parts of the spelling of Hindi. I would suggest that you change things so that no syllables with anusvara or chandrabindu generate categories. Benwing2 (talk) 08:46, 31 August 2023 (UTC)
That appears to be a design error then. You should be able to specify the chandrabindu and anusvara by themselves and have all syllables involving them be part of standardChars. Alternatively, use ranges. You should not have to enumerate every combined char. Benwing2 (talk) 08:52, 31 August 2023 (UTC)
OK. Maybe in the meantime you should switch back to whatever you had before the latest switch, since it wasn't leading to all these spurious categories? Benwing2 (talk) 08:56, 31 August 2023 (UTC)
@Benwing2 I can switch Hindi back to the old method, yeah, which is how things still are for Lao. We want to get rid of the special-case carve outs at some point soon, though, but it's lower priority than some of the other stuff I need to do first. Theknightwho (talk) 08:59, 31 August 2023 (UTC)
The char ः (visarga) is AFAIK not common in Hindi, only found in a certain number of Sanskrit borrowings, so maybe should generate a single category (for the visarga itself, not for compositions of visarga plus anything else). @AryamanA, RichardW57, JainismWikipedian, Kutchkutch, Svartava, Vivaksha can maybe comment more (although many of them aren't currently active). Benwing2 (talk) 08:50, 31 August 2023 (UTC)
@Benwing2 Hi, on the visarga point, that's right. But there are a few affixes like निः- (alt for निस्, निश्) or -तः which have a handful of formations, though I wouldn't mind their inclusions in the category given the limited numbers. Svartava (talk) 09:24, 31 August 2023 (UTC)
@Benwing2: In my opinion, visarga is best treated as a stand-alone letter, except where it's part of a compound vowel symbol (most Thai, Lao, Tai Tham and New Tai Lue script writing systems) or possibly where it's a tone mark (most Myanmar script writing systems). For the New Tai Lue script, it's doubtful that the etymological visarga U+19B0 NEW TAI LUE VOWEL SIGN VOWEL SHORTENER should be considered a visarga. Like the Thai and Lao visargas, it has gc=Lo nowadays. --RichardW57 (talk) 11:09, 3 September 2023 (UTC)
On the other hand, the 'Pali virama' U+0E3A THAI CHARACTER PHINTHU can behave as a consonant modifier, like a hacek; i.e. it can be a nukta. Sometimes, though, it's part of a compound vowel character. --RichardW57 (talk) 11:09, 3 September 2023 (UTC)
@DCDuring Thanks - I see what you mean now. I’m currently doing a major rewrite of the link module that will mean you can use {{col3|en|] ('']'' spp.)|}} to achieve what you want.
I can type l|mul| almost as easily as mul:. More importantly, I still prefer to have the edit-window content of column templates to be in alphabetical order to allow me (or any editor so inclined) to place on the same line hyphenated, open-spelled, and solid-spelled terms with the same morpheme/words. The sam applies to other terms like yellow-throat wren and yellow-throated wren (a made-up, but illustrative name). DCDuring (talk) 14:40, 6 September 2023 (UTC)
@DCDuring Could you please let me know what formatting you’d prefer for it to work inside a column template? There’s nothing stopping you sorting the Wikitext into alphabetical order, but the existence of automatic sorting means it isn’t mandatory. Sorting will ignore any Wikitext formatting like italics or ], so you can still use those without it causing problems. Theknightwho (talk) 14:56, 6 September 2023 (UTC)
@DCDuring Please just answer the question. I’m trying to work with you here, but if you keep refusing to compromise then the changes are just going to happen without your input. Most users do not want to manually sort terms, and when Wonderfool dumps 100 unsorted terms into a list (which he often does), then it is not fair to force other users to spend time clearing that up by manually alphabetising them. He’s not the only one to do that, either. Theknightwho (talk) 16:33, 6 September 2023 (UTC)
The amount of manual work I would have to do with any of the ideas you have suggested is simply not worth it to me.
On a separate note, I’ve been wondering if we should have a special language code for taxonomic links. They should still be part of Translingual, but we could make them an “etymology-only language” (which in reality means it’s treated as a “dialect” or “variant” of Translingual). This would mean we could make the italics automatic, and we could potentially handle the work of {{vern}} and {{taxlink}} (etc.) automatically.
Taxa above the rank of genus do not have italics, except in kingdoms Bacteria and Cyanobacteria and in Viruses. Also, elements such as "var.", "f.", and "subsp." are not italicized. I'm not sure about some other elements, like candidatus.
Obviously the usual terminology like “etymology-only language” sounds silly in this context, but under the hood they simply work like customised versions of the main language (which in this case would be Translingual). Theknightwho (talk) 13:37, 6 September 2023 (UTC)
The language betrays it being a kluge, but if there were a specific foreseeable and worthwhile use for such a thing, we could do it. But even the logic of italicization requires extra manual input for many taxa, eg, rank and membership in the kingdoms that italicize supergeneric ranks. It doesn't seem worthwhile to me yet. DCDuring (talk) 14:40, 6 September 2023 (UTC)
@DCDuring The main issue is the name: “etymology-only language” should probably be renamed “variant”, which more accurately describes what they actually are. The logic of italicisation could be determined via entry scraping, in the same way that Chinese transliteration works. The most obvious solution would be to look for the {{taxon}} template at the target entry, which gives the rank. No doubt there will be edge-cases that require manual input, but that applies to anything like this. Theknightwho (talk) 15:03, 6 September 2023 (UTC)
@DCDuring Yes, so long as the information is given on the entries in a reasonably consistent way. Using a dedicated taxonomic langcode would mean only taxonomic links would be affected. Theknightwho (talk) 16:45, 6 September 2023 (UTC)
I'm not looking forward to, nor planning on, going through all the taxonomic names that appear in definitions, etymologies, or derived or related terms to add some indication that they are taxonomic names. Taxonomic names were not entirely welcome when I began inserting {{taxlink}}, so I removed the template entirely once the taxon entry was created. They are now bare links in most places they occur. Hence, my lack of enthusiasm. DCDuring (talk) 17:47, 6 September 2023 (UTC)
@DCDuring This is precisely why language codes matter, because it means we don’t end up in situations where we have to manually update everything with no way of knowing how many things there are to update. Instead, we can update everything at once. Theknightwho (talk) 15:26, 7 September 2023 (UTC)
Give me an example of some productive updating of taxonomic-name entries that can be accomplished by virtue of use of a form of "mul" in a lang-code-needing column template. At best, they might help someone find the taxonomic names, at least if there were a distinct langcode, preferrably of three letters, for taxonomic names. DCDuring (talk) 21:12, 7 September 2023 (UTC)
@DCDuring If you hadn’t removed them all, it would have been trivial to implement this.
I suggest you bring the issue of language codes in column templates up at the Beer Parlour, because it has long been widely-accepted practice that we don’t give bare links outside of definitions. If you want an exception to that, then it’ll need wider discussion. Theknightwho (talk) 00:45, 8 September 2023 (UTC)
If you would like to remove all the improperly templatized content, feel free. If you would like to insert the proper templating, feel free. It is a wiki, with voluntary contributors. DCDuring (talk) 09:23, 8 September 2023 (UTC)
I have to agree with User:Theknightwho here about bare links in column templates. I have written scripts to try and properly templatize bare links in lists in Synonyms, Derived terms and similar sections. I can try to do the same for bare links in column templates but there's only so much that can be done automatically. I also agree that we should eventually rename "etymology-only language" to something like "variant"; this was also suggested by User:-sche. In addition, I don't see anything kludgy about a 'vern:' prefix; for example, I've added support to quotation templates for Wikipedia and Wikisource prefixes in various guises. Benwing2 (talk) 18:21, 9 September 2023 (UTC)
more memory errors
Latest comment: 1 year ago6 comments2 people in discussion
Sorry to keep harping on this, but once again you've done some hacking on core modules and the number of memory errors has increased by 5-10. This seems to happen every time you do hacking on core modules and it simply cannot continue in this vein. We're now at 60 memory issues. This is not simply a "throw up your hands and blame Scribunto" issue as you tend to do, because I don't seem to run into this issue. I would strongly recommend you revert all your changes in the last 24 hours and try to think of a different way to do whatever you were trying to do. In particular, I told you before that adding new modules is not the way to go; you can add code to existing modules without materially affecting memory errors in most cases but new modules have a big overhead. Benwing2 (talk) 18:56, 11 September 2023 (UTC)
@Benwing2 The only change in the last 24 hours has been moving data from Module:utilities/data into two separate data modules, so that they don't get loaded unnecessarily. I also don't usually see big increases with new modules, unless it's done at a large scale. Theknightwho (talk) 19:46, 11 September 2023 (UTC)
We now have 67 memory errors when we had 54 yesterday. I strongly suspect your changes did it, even if you disagree. Please revert so we can see if this is what did it. Benwing2 (talk) 19:49, 11 September 2023 (UTC)
@Benwing2 Fine, but please stop acting like this stuff is really obvious, because we both know it isn't - intuition says this would have made no difference at worst. I only made the change 2 hours ago. Theknightwho (talk) 19:54, 11 September 2023 (UTC)
Yes, I agree it's not obvious, apologies if I made that implication. What I'm trying to emphasize is that you have to be cognizant that anytime you add code, convert something to Lua or break up a module into parts, it might increase memory, so you need to monitor CAT:E and think about alternative approaches. I'd like to see you more proactive here, so it's not me constantly nagging you to do something about memory increases (which I don't like to be in the position to have to do ...). Benwing2 (talk) 19:59, 11 September 2023 (UTC)
@Benwing2 You're right, and I do try to keep an eye on it. In this case, I didn't see how there could be any risk of a dramatic increase, since at worst the same data would be being loaded. Theknightwho (talk) 20:15, 11 September 2023 (UTC)
ancestors= vs parent in etymology languages
Latest comment: 1 year ago4 comments2 people in discussion
Can you clarify the difference between the two? In particular, all the various etymology-only Latin varieties have "la" as the parent but the appropriate parent in "ancestors". Why not put for example la-ren as the parent of la-new, and la-med as the parent of la-ren, and la-eme as the parent of la-med, etc.? I just created etymology-only varieties of Alemannic German (gsw), and I have for example Walser German (wae) with parent gsw-hst (Highest Alemannic German), which in turn has parent gsw. Is this wrong? Similarly, Early Scots (sco-osc) lists Middle English (enm) as its parent, but Northern Middle English (enm-nor) as an ancestor, while enm-nor lists enm as its parent. And similarly for Geordie English, which should logically have en-GB as its parent but instead has en as its parent and enm-nor as its ancestor (?), and doesn't list en-GB at all. The documentation says that ancestors= is only for the old ancestral-to-parent relationship, but obviously this is not the case. Benwing2 (talk) 04:21, 13 September 2023 (UTC)
Hmm, I suppose "ancestors" represents an ancestor (in time) relationship while parent= represents a containment (geographic) relationship. If so, then Geordie English should probably have en-GB as its parent and the others may be correct, but we really need to clarify the documentation as to how these relationships are used. Benwing2 (talk) 04:23, 13 September 2023 (UTC)
@Benwing2 The parent is the supertype (e.g. Early Middle Japanese ∈ Middle Japanese ∈ Japanese), while the ancestor is the predecessor. la-new is not a type of la-ren, so la-ren can't be its parent. la-ren is its ancestor, though.
The reason it gets confusing is because sometimes a child can be an ancestor or descendant of its parent. In the case of Latin, Old Latin is the ancestor because it's the progenitor of all other varieties of Latin, but we still want to group it under the Latin code. There's also the fact that a lot of the data is a mess, whether that's due to disagreements over how things should be arranged, bad source data, or misunderstandings over the technical terminology (i.e. the difference between ancestor and parent). This becomes particularly difficult when you have languages like Chinese or Persian, where the line between language, family and variant (etym-only language) becomes really arbitrary. Theknightwho (talk) 12:32, 13 September 2023 (UTC)
Thanks. Shouldn't then Geordie English have en-GB as its parent? If you look at Category:Geordie English, you can actually see the whole trail of breadcrumbs indicating British -> English -> Northern England -> Northumbrian -> Geordie, where for whatever reason only the first and last have etym-only language codes. (The breadcrumbs aren't completely consistent in their usage of nouns vs. adjectives because they follow the naming of the categories.) We should probably have etym-only codes for all the intermediate varieties, although if we do this then at a certain point we need to split up Module:etymology languages/data (keeping in mind that splits don't always decrease memory, as in the above discussion; for example, User:Erutuon did an experiment once splitting the full language data into submodules based on the first two letters, which makes sense esp. for Module:languages/data/3/k with 600+ languages in it, but this actually increased memory on several pages). BTW I definitely agree with renaming "etymology-only language" to "variant" (or "variety", which is maybe better because it's more standard terminology, or "lect"); all three terms are shorter, clearer and less awkward than "etymology-only language". Benwing2 (talk) 20:50, 13 September 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
hi. Module:languages make_language() is broken when an etymology language has an etymology language as a parent. In particular, la-con is being treated as an alias of la-new instead of a child, and same for la-eme being treated as an alias of la-med; although for some reason this doesn't happen with the children of ko-cen. Can you take a look? I'm having a hard time understanding how you've structured the stacks in Module:language. Benwing2 (talk) 21:59, 14 September 2023 (UTC)
Latest comment: 1 year ago8 comments2 people in discussion
Sorry to be a dick and keep reverting on this one minor entry, but "now historical" is not tautological, it's actually quite important. There are many words that have always been historical in English (describing aspects of the Classical world, for instance, Roman clothing or whatever), but marine acid air is not one of them; it is not historical in quotes from the 18th century, only in quotes from the late 19th century onwards. So "historical" would not be accurate (that is, if we had a more representative sample of citations!). "Now historical" draws attention to the fact that earlier citations are using the word in a current sense but that later ones are not. Ƿidsiþ07:24, 16 September 2023 (UTC)
@Widsith This applies to a large number (in fact, most) historical terms, and isn’t really relevant - we don’t label Austria-Hungary “now historical” either, since historicity is dependent on whether the referent actually exists. I also see only one citation, from 2004. Frankly, this suggests that you’re getting muddled between “historical” and “obsolete”, and a single cite using it historically isn’t very convincing, since obsolete terms are sometimes used that way. Theknightwho (talk) 10:24, 16 September 2023 (UTC)
Well, it's the distinction used by the OED and I've always found it sensible and useful. I would agree though that most "historical" terms are better off tagged "now historical", though certainly not all. I don't find the difference between this and a word that's obsolete confusing in the least, in fact I'd always assumed it was fairly self-evident. Ƿidsiþ11:57, 16 September 2023 (UTC)
@Widsith That only makes sense if we place any value on whether the term has always been used to refer to something in the past, which is something that makes sense for an etymology section or defdate and not a label. Adding "now historical" doesn't seem to add anything that couldn't be better explained in a different way. There's also the fact that the term clearly is obsolete when used other than in a historical sense, so the distinction is very much not self-evident. Theknightwho (talk) 12:15, 16 September 2023 (UTC)
"That only makes sense if we place any value on whether the term has always been used to refer to something in the past" – well, quite. If you value Wiktionary's ability to be a historical dictionary (which I do) and are particularly interested in citations (which I am), then that is very much to the point. Ƿidsiþ12:53, 18 September 2023 (UTC)
Are you deliberately going for "patronising"? You might do me the courtesy of assuming I read what you wrote and considered the comment relevant anyway. But to spell it out, then, I think the question of whether a word has changed from current to historical over time is absolutely relevant information for a usage label on the definition line. It is with such labels in mind that we interpret the citation evidence. You may disagree, which is fine, but it's hardly a crazy idea, it's not something I've invented, and many dictionaries do it (or something similar). Ƿidsiþ13:46, 18 September 2023 (UTC)
@Widsith I would have given you that courtesy if what you said wasn't directly addressed by the second half of the sentence. It may well be relevant information, but it's not information that's conveyed in a useful manner by putting "now historical", which is opaque at best. Theknightwho (talk) 13:50, 18 September 2023 (UTC)
thank you
Latest comment: 1 year ago2 comments2 people in discussion
Hey, just wanted to say thanks for the work you've done on etymology languages. There was a decision a couple of years ago to merge all the Prakrit varieties into a single Prakrit language and make the varieties etym-only languages. The lemmas were almost all moved under the Prakrit header but the old full languages not deleted. I'm fixing this up (a big job ...), and switching completely to etym varieties like this puts a lot of stress on the etym-language infrastructure. I'm sure before your changes this would have broken entirely. As it is I've had to change a bunch of modules/templates to accept etym languages when they didn't before, but overall the infrastructure is holding up (e.g. I can make a family have an etym language as its proto-language, and properly express the inheritance structure of the various modern Indo-Aryan languages back to the respective Apabhramsa and Prakrit varieties and then to Ashokan Prakrit and finally Vedic Sanskrit). Benwing2 (talk) 22:21, 16 September 2023 (UTC)
I suspect so, though I don’t think that’s a sensible error to throw - seems more like the kind of thing that should be dealt with via maintenance categories. I’ll have a look. Theknightwho (talk) 00:40, 17 September 2023 (UTC)
8 more memory errors
Latest comment: 1 year ago7 comments3 people in discussion
CAT:E is up to 64 now, from 55-56 yesterday. I think the new pages are kana, Malta, 구, 이, 기, 사, ϩⲱⲗ, ⲟⲩⲱϣ, ⲱϣ. Could your Japanese sort key changes have triggered this? Can you try reverting them and seeing if the memory errors on those pages go away? I'm not sure if the Coptic pages will be affected but I suspect the others will. Benwing2 (talk) 18:37, 19 September 2023 (UTC)
@Benwing2 I think it's unlikely - it ws 61 when I made the change, as I noticed it had increased today. I'll investigate. The Coptic pages are time-outs, by the way, not memory errors. Theknightwho (talk) 18:39, 19 September 2023 (UTC)
Oh, but it is helping. It's an effective, but sloppy workaround to a long-term problem. You know, "more coding needed" isn't necessarily the solution for coding-based messes. P. Sovjunk (talk) 20:21, 19 September 2023 (UTC)
@P. Sovjunk If you had bothered to look at the output you'd see a bunch of errors in each split page; a lot of modules depend on the pagename being correct, and splitting it like this results in the pagename being wrong. Benwing2 (talk) 20:55, 19 September 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
How is it "wrong" when there are 0 differences in pronunciation (putting secondary stress after the primary one is just a waste of space)? SummerKrut (talk) 21:15, 20 September 2023 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
Hey, I got blocked from Discord by CloudFlare for whatever reason. My Mac app was showing that there were messages that had failed to load, and I did Ctrl+R to refresh. I've already sent a report to Discord; is there anyone else I should notify in order to get unblocked?
Latest comment: 1 year ago5 comments4 people in discussion
Did you add this? If so, what does it mean? It's not at all obvious to me what the meaning is, so maybe there's a clearer name? Benwing2 (talk) 07:20, 8 October 2023 (UTC)
@Chuck Entz Thanks. User:MedK1 I think this should be called 'Requests for attestation of LANG terms' or similar. This follows the existing 'Requests for ...' categories and it states what is actually being requested rather than simply indicating that there's a warning. Benwing2 (talk) 19:40, 8 October 2023 (UTC)
I too thought it could've gotten a clearer name, but that was the best I could come up with... my bad. I agree that 'Requests for attestation of LANG terms' is much better! MedK1 (talk) 20:26, 8 October 2023 (UTC)
straight vs. curly apostrophes
Latest comment: 1 year ago4 comments2 people in discussion
I see you have been changing straight apostrophes to curly apostrophes in various places, e.g. in Module:ru-adjective. As a general practice, I don't agree with such things and the most recent vote concerning this failed with no consensus. One of the reasons for disagreeing with this is that e.g. if I type CAT:Russian adjectives with accent pattern a' into the search bar, it doesn't correctly get me to the category in question, but says the page doesn't exist. In this case, it's true that the "correct" name shows up in the search results below but I don't agree that this is the right thing to do, as I (and most others) have no idea how to type the numerous Unicode apostrophe variants on my keyboard or even know which variant is being used where. It feels like needless and counterproductive pedantry to insist on such changes. Since there is no consensus to make these changes, please don't do any more in the future. Thanks. Benwing2 (talk) 02:31, 18 October 2023 (UTC)
@Benwing2 That isn’t a straight/curly apostrophe change - it’s the modifier letter prime, which is used in our transliteration system for Russian and Zaliznyak's work which we draw that system for classifying adjectives from (the latter being the reason why I felt it appropriate to change it). It’s in the edit summaries, and I would oppose changing it back. Theknightwho (talk) 03:34, 18 October 2023 (UTC)
@Theknightwho IMO this is similar to forcibly changing the names of languages to contain curly apostrophes or any variant of them (including "modifier letter prime"). I oppose that change for the same reason, and I am not the only one. If you had asked me before changing it I would have definitely opposed making the change, meaning you shouldn't have made the change. I'm not going to revert your change at this point but definitely don't unilaterally make any similar changes in the future. Benwing2 (talk) 03:45, 18 October 2023 (UTC)
@Benwing2 I won’t change something like this unilaterally again, but this is definitely not just a stylistic difference (unlike straight/curly apostrophes, which I don’t care about). Theknightwho (talk) 03:50, 18 October 2023 (UTC)
Category links display
Latest comment: 1 year ago5 comments2 people in discussion
Sorry, but it doesn't work for me, no matter whether it is in skin-specific (Vector legacy) or all-skin JS and no matter the order with respect to other JS. I won't use it unless I really need it. DCDuring (talk) 12:36, 20 October 2023 (UTC)
@DCDuring Fair enough - I did try to see what the issue could be, but it's difficult to diagnose the bug without being able to see what you're seeing. I'll keep this in mind as a reason why JavaScript should only be used sparingly. Theknightwho (talk) 12:39, 20 October 2023 (UTC)
What I see is the two links to the left of the category item. When it first loads for a fraction of a second I see the links in their proper place to the right. You can, I believe, look at the other stuff I have in the various JS and CSS stashes and my gadgets. I have no Beta items running. I did purge the cached page today after every change I made. Maybe I should have done a hard purge. DCDuring (talk) 12:50, 20 October 2023 (UTC)
@DCDuring I should've been clearer - what I meant was there could be any number of weird factors going on in your local environment that are causing this, and it could also be caused by us using different skins, a gadget, another script (global or Wiktionary-specific), or some other difference in our wiki preferences that we haven't thought of and/or may have forgotten we enabled/disabled. While I'd like to solve the issue, it could potentially take a good deal of time for me to do so, so what I meant was that it'd probably be a lot quicker if I were there in person with you, and not on the other side of the Atlantic. I'll have another look later, and see what I can do. Theknightwho (talk) 12:57, 20 October 2023 (UTC)
I don't expect or need you to solve this if it is as complicated as it seems. Thanks for trying. I'll experiment with the JS scripts I have running and may no longer need. DCDuring (talk) 13:15, 20 October 2023 (UTC)
Definition of ба́нка as “guns”
Latest comment: 9 months ago5 comments3 people in discussion
For an English speaker unfamiliar with bodybuilding jargon who does not speak Russian, the definition of ба́нка(bánka) as “(in the plural, figuratively, slang) guns (muscles)” is confusing, as it is unclear (was to me, anyway) whether it means “guns” in the sense of “firearms”, for which perhaps some people use “muscles”. I tried to alleviate this by writing ‘muscles (“guns”)’, but you evidently found that worse. I have tried a different tack: “guns (in the sense of muscles)”. If you do not like that, can you make a better suggestion? PJTraill (talk) 17:10, 25 October 2023 (UTC)
@PJTraill The gloss goes in brackets, so it should be "muscles", not "in the sense of muscles". It's just how it's done. The reason it's that way around is because "guns" is the most accurate translation (i.e. it has the same connotation), and the gloss is there for clarity. Theknightwho (talk) 17:16, 25 October 2023 (UTC)
It is slang, so it can mean biceps, but also some other parts of muscles. There was even some discussion about it on the Internet. I have no clue what is right, but the meaning "biceps" seems to be common on Internet. Tollef Salemann (talk) 16:59, 6 February 2024 (UTC)
Latest comment: 1 year ago4 comments2 people in discussion
I went to add DEFAULTSORT to a Japanese page and I got a message that it's now handled automatically. How does this work? In particular, I wanted to add it so that categories added by {{place}} automatically get the correct sort key, which otherwise need to have |sort= added to each invocation of {{place}}. The message said something about the headword handling this, but how exactly? The sort key is based on the pronunciation, and that is not easy to determine automatically. I am asking because I'm planning on renaming a topic category to a poscat category across the board, and I need to know if I need to worry about figuring out and adding the right Japanese sort key to {{cln}}. Benwing2 (talk) 05:03, 6 November 2023 (UTC)
@Benwing2 We shouldn’t ever use DEFAULTSORT because sorting can’t really be language-neutral - Module:headword/data creates a sortkey for non-language categories which is based on a normalised form of the page name (accounting for things like unsupported titles). Have a look at Module:Jpan-sortkey if you want to see how it works specifically for Japanese-script languages, though, but the essence is that it scrapes the reading from the headword. Theknightwho (talk) 05:15, 6 November 2023 (UTC)
@Theknightwho The use of DEFAULTSORT was restricted to pages with only a Japanese entry on them. I got your ping about secondary readings of terms with kanji in them; can you make a list of the conditions under which Japanese sort keys are needed and add it to the abuse filter message? Otherwise it's pretty confusing for someone like me who doesn't know the Japanese templates in detail. Benwing2 (talk) 05:33, 6 November 2023 (UTC)
@Benwing2 Having cleared a few thousand of them, they definitely weren’t in practice, and any system which relies on no other languages being present breaks as soon as someone needs to add another language.
The only condition is when the relevant sortkey isn’t the first reading on the page, since that’s the one the scraper will use. It’s rarely needed, but I can add it, yeah. Theknightwho (talk) 05:37, 6 November 2023 (UTC)
When you've got time
Latest comment: 9 months ago4 comments2 people in discussion
Latest comment: 11 months ago4 comments2 people in discussion
We have a lot of stuff in CAT:E now, all related to Module:category tree/poscatboiler/data/language varieties and mostly to sign languages. This is code I wrote, but I haven't changed it since Oct 11 and the last change is a minor change you made on Oct 13. Since you just pushed your change to Module:templateparser and the language-variety code relies on this module, I suspect the breakage is somewhere in your code, which isn't compatible in some way with the former code. Can you take a look? There are > 200 errors now due to this. Benwing2 (talk) 08:36, 24 November 2023 (UTC)
@Benwing2 Thanks - I've worked out what the bug is: when the input string is an exact match for a template, it means that a template object is returned from the parse. When :iterate() is then called on that object, it (recursively) iterates over everything inside it, but doesn't return itself. This is a problem for categories which have {{auto cat}} etc. Theknightwho (talk) 09:09, 24 November 2023 (UTC)
question about substitutions prior to calling translit/sort/etc.
Latest comment: 11 months ago11 comments2 people in discussion
I'm trying to implement scraping for Thai translit but I don't completely understand all the preprocessing you've implemented in Module:languages. Assuming I add Thai to contiguous_substitution, what exactly gets passed to the translit function? Essentially, I want to transliterate each space-separated sequence of Thai characters independently except that linked text needs to be transliterated as a whole even if it contains spaces in it. There's also special processing for sequences of characters surrounded in single braces. So I need to know how I check for and pass through the PUA characters that you use in place of various formatting stuff, and how to check for the beginning and ending of internal links. You do something weird with PUA characters and also apparent with \1, \2 and \0, but I don't know what the state of things is when the text gets passed to the translit (or sort/display) function. Benwing2 (talk) 03:29, 30 November 2023 (UTC)
@Benwing2 I have a feeling that what you're asking for isn't fully possible, because the processing to PUA makes it impossible for the translit module to know what those PUA characters are standing in for, as they're essentially meant to be ignored, so they could be link brackets, but they could also be external links (unlikely) or style apostrophes (quite likely), among other things.
Would it help if we created a full bypass route for Thai, which goes straight to the translit module? I'm not keen on having that in the long-term, but we'll need to use one anyway once the wikitext parser is ready to be rolled out, since we'll need to do the roll-out gradually (as it'll require modifications to various lang-specific modules/data etc). Theknightwho (talk) 03:44, 30 November 2023 (UTC)
@Theknightwho Yes, that would be helpful. BTW the current way that Module:zh-translit works seems non-optimal; it is tightly coupled with Module:languages and works only based on the fact that separate transliteration sections are delimited with square brackets; any syntactic change would be difficult. Benwing2 (talk) 03:59, 30 November 2023 (UTC)
@Benwing2 Yes, exactly - the parser should solve that, as well as the tight coupling between Module:links and Module:languages. It's in a pretty good state at this point: I'd stil like to add a bunch of methods to the output (since I want it to be as user-friendly as possible), but hashing out what makes most sense will require some trial-and-error work in converting existing modules to cope with the new format. There are still a few advanced things I haven't done yet, like the file format, but they aren't necessary at this stage. Theknightwho (talk) 04:11, 30 November 2023 (UTC)
@Benwing2 It can recognise when it's started parsing a file, but I've not done the special logic yet since file syntax is actually pretty complex (wikilinks in captions, newlines are allowed etc). For the moment I can set it to fail the route if it encounters a file, which means that part would get parsed as plain text. Not ideal, but it's no different to how things are right now. Theknightwho (talk) 04:33, 30 November 2023 (UTC)
@Theknightwho That sounds fine to me; as long as the link gets passed through as plain text and not mangled, it should be completely fine as most applications won't need to process such links. Benwing2 (talk) 05:17, 30 November 2023 (UTC)
In terms of going forward, we can certainly start rolling out the parser relatively soon, but we’d be working with a relatively “inert” output, in the sense that any further processing would need to be done by manually iterating over it, since the only method that it currently has is a special recursive __pairs metamethod. This can easily be used for stuff like link target/display text replacement and so on, but each section of text would need to be treated as pretty self-contained unless you do a lot of extra work to check other sections within the object.
The aim is to eventually have a bunch of methods similar to the string library, with a flag parameter for things like “display text only” etc. Ideally, this would mean you could have something like a'''b''', but the methods would treat it as ab, since the apostrophes would have been parsed as formatting. This would mean modules could be written to assume no formatting, but the info’s all still there if it needs to be used (e.g. Thai translit). Theknightwho (talk) 08:05, 30 November 2023 (UTC)
Latest comment: 11 months ago5 comments4 people in discussion
I saw your edits about the transliteration of the Bashkir language, with an interesting description. Like, I'm doing transliteration that doesn't correspond to the language our people speak, etc. Okay, you can also be understood, because as I see you are not a representative of our people, and accordingly you do not know the language. Let me explain which phonemes correspond to any Bashkir vowels
(Cyrillic): А –
Ә –
Ө –
O –
Ү –
У –
И –
Э/Е –
Ы –
As you, and any thinking person, will see the discrepancy between the Cyrillic Bashkir vowel letters and the corresponding symbol of the IPA. If you don't believe me, here's a video from the podium confirming that: https://youtube.com/shorts/UNqc_mcmJr4?si=jJTpo7qNwnPJ3Vb -
(Sites: www.ipachart.com , www.glosbe.com , www.en.m.wiktionary.org )
But, there is a project on the Bashkir Latin alphabet (transliteration), where the vowels just correspond to how the Bashkir language sounds. Don't think that no one uses it, articles on wiki are transliterated with this alphabet, various kinds of literature are transliterated, and they also just communicate in general Turkic chats (by the way, it's very convenient, almost any Turkic understands Bashkir, and reads very close to Bashkir pronunciation (for example, the same Uzbeks who would even use Cyrillic I did not understand how some letters are pronounced, but I am silent about the correct pronunciation, there are terrible errors)), I also personally communicate with well-known Bashkir linguists who also work on the Common Turkic language (Öztürk tili), and of course on the Bashkir language itself, online dictionaries, etc. (for example, my friend - Iskander Shakirov).
Let me show you this correspondence between the IPA and our Latin alphabet.
A –
Ə –
Ü –
U –
Ö –
O –
E –
İ –
̰I –
Example text
Cyrillic alphabet:
Рәсүл булһаң, ишан, әфйун урынына ғилем, әхлаҡ өләш. Ш. Бабич. Ленин, Сталин тәғлимәттәре әфйун кеүек ойотто әҙәм балаһының зиһенен. Т. Ғарипова.
Latin alphabet:
Rəsul bolhañ, işan, əfyon orınına ğilim, əxlaq üləş. Ş. Babiç. Lenin, Stalin təğliməti əfyon kiwik uyuttu, əðəm balahınıñ zehinin. T. Ğaripov
For comparison, the General Turkic:
Räsul bolsañ, ışan, äfyon orınına ilim, äxlaq üleş. Ş. Babiç. Lenin, Stalin tälimeti äfyon kibik uyuttı, adem balasınıñ zihinin. T. Ğaripov
+ This topic brings the Bashkir language closer to other Turkic languages, because the purpose of Latin and transliteration is specifically so that representatives of other peoples and languages can somehow read the text in any language. Of course, there are errors in automatic transliteration everywhere, in Bashkir and Kazakh (Kazak gramar) there are problems with Arabic and Persian, as well as with European borrowings. I will give an example with the words Жәдиди (Йәдиди), Рух, Тарих. In the Kazakh Latin alphabet, these words are obtained as: Jädiydiy, Rıwh, Tarıyh. In Bashkir Latin as: Yədede, Rox, Tarex. But, people who know how to program can fix this problem so that it is written in Kazakh as: Jädidiy, Ruh, Tarih. And in Bashkir as: Yədidiy, Rux, Tarix. This problem is solvable in principle, although some words in Kazakh and Bashkir will have to be written by yourself. In Bashkir, there are also other problems, for example, the words Ғаилә, Тәьҫир, Социаль, Донъя. They are written in Latin in theory as: Tailə, Tə'cir, Sosial, Dunya. But for transliteration, you will have to make a code so that "ъ" and "ь" are written specifically for Arabic and European borrowings. Also, so that the two running vowels, like the word Ғаилә, are also spelled correctly, and not as Ğaelə. Why did I mention the topic of auto transliteration? Because I know that you have questions because of the transliteration curve of some words in victionari. I also have questions about the Kazakh transliteration, and both Bashkirs and Kazakhs need normal programmers to write code on a page with a translation into a victim.
But, fortunately, there are not very many such words, both in Bashkir and Kazakh.
I have a question for you, why did you pay attention to the Bashkir transliteration, but do not pay attention to the current Kazakh in the victionary? It's a complete horror. the word yogurt is written as iogurt. And the Kazakh letter Щ, which stands for the sound of a double Ş (Ащы = Aşşı) is written as ştş.(Aщы = Aştşy)
@Başqurd There have been a lot of discussions involving me, User:Atitarev and maybe some others on proper Kazakh transliteration. The problem is that the government-sponsored Kazakh Latin alphabet is a constantly moving target, and we've been reluctant to change things until it settles down. Benwing2 (talk) 08:25, 30 November 2023 (UTC)
@Başqurd: I know Kazakh translit is a mess but I don't have the knowledge to fix it and it may need a lit of exceptions. We had a few discussions but no actions taken. It is also more than one module responsible for it. Anatoli T.(обсудить/вклад)09:41, 30 November 2023 (UTC)
Is that pronunciation-spelling discrepancy in Bashkir the result of sound changes that took place after that spelling was established? Rodrigo5260 (talk) 22:09, 30 November 2023 (UTC)
No. This pronunciation has always been there. But in the USSR, when creating the Bashkir language and other Turkic languages, there was a policy of DISTANCING languages from each other. When creating the literary Bashkir language, there was a policy to distinguish the Bashkir language by introducing a new alphabet that differed from other Turkic (from Kazakh, Tatar, etc.), and standardizing for the most distant (geographically) There are dialects from the Turkic languages. Başqurd (talk) 05:13, 5 December 2023 (UTC)
References
^ История башкирского народа: в 7 т./ гл. ред. М. М. Кульшарипов; Ин-т истории, языка и литературы УНЦ РАН. — Уфа: Гилем, 2010. — Т. V. — С. 348. — 468 с. — ISBN 978-5-7501-1199-2.
^ Гарипова Ф. Х. Опыт языкового строительства в Республике Башкортостан. Уфа, 2006. — 170 с.
{{zh-der}} and {{col3}}
Latest comment: 11 months ago1 comment1 person in discussion
Latest comment: 11 months ago7 comments3 people in discussion
Hi. I see you added a bunch of intermediate families for Dravidian languages. I have the same concerns about this as when you added a bunch of intermediate families for Romance languages. In general, please get consensus before *ALL* additions of languages and families, even if it seems like an obvious gap to you. I don't know that much about Dravidian languages but I'm concerned that their internal structure may not be settled, so it may be premature to create subfamilies for groupings of languages that may still be contested. It looks to me like the internal structure you've imposed comes only from Glottolog (and from a paper published less than two years ago, at that). Just for comparison, Indo-European languages have seen *FAR* more study than Dravidian languages and their internal structure is still unsettled in many respects. Ping User:-sche for visibility. Benwing2 (talk) 02:14, 6 December 2023 (UTC)
@Benwing2 I actually made sure to avoid using anything which could only be traced back to Glottolog (e.g. I could find no independent uses of the "South-Western Dravidian" familiy). Instead, I tried to base it on how our Proto-Dravidian descendant sections work, though the final result possibly does make too many distinctions for languages which are quite poorly understood, and I have a suspicion that Glottolog is probably the underlying source for how those descendant sections have been laid out (via Wikipedia). That all being said, I preferred "Malayalamoid languages", "Tamiloid languages" (etc) for family names over "Malayalam languages" and "Tamil languages", because (a) it seemed dismissive to those languages, many of which already have trouble being seen as separate languages (as opposed to dialects), and (b) it creates the possibility of ambiguity and confusion. However, I did do this where I could find no other term outside of Glottolog (e.g. Gondi). I would also be interested to hear what @-sche thinks. Theknightwho (talk) 15:21, 6 December 2023 (UTC)
There were already talks on adding codes of upper proto languages of Dravidian branches in Category_talk:Proto-Dravidian_language, North, Central, South Central and South, some of them like SD, SCD were needed, {{R:dra:DL}} also has reconstructed terms for them, many of which are endemic to those groups and cant be reconstructed to PD. There are disagreements are on what to name Tamil-Tulu, in {{R:dra:DL}} its BK calls its South Dravidian I and Telugu-Kui as South Dravidian II but the common name for them is South for Tamil-Tulu and South Central for Telugu-Kui (was discussed here), BK also disagrees on a "Toda-Kota" branch though its accepted by others. AleksiB 1945 (talk) 10:01, 7 December 2023 (UTC)
proposal for changes to Module:links for Thai and Khmer
Latest comment: 10 months ago20 comments4 people in discussion
On a different note, I'd like to make a proposal for the changes I'm trying to implement for Thai and Khmer scraping translit.
WARNING: This started out short but ended up being very long as I expanded it with info on what I think is going on in Module:links and Module:languages and how to fix it. The TL/DR is we need some changes for Thai and Khmer, among which are being careful not to do the same processing twice (since it's not idempotent for these languages); but we don't need a fundamental rewrite.
To get started, Thai and Khmer need to scrape the respelling of individual words and pass that respelling through Module:th-translit/Module:km-translit to get the appropriate translit. This is conceptually similar to what you implemented for Chinese, except that I'd like to use single spaces to separate words instead of brackets, to minimize typing (<= 1 char per word instead of 4) and to maintain compatibility with {{th-x}} and {{km-x}}. Specifically, the languages support input text with the following special conventions:
Single spaces separate words for the purpose of translit scraping. In the corresponding display text (see below for what this means), there are no spaces between words but each word is linked.
Double spaces are used to represent actual (single) spaces in the display text, which (AFAIK) occur between clauses instead of words.
Respelling substitutions can occur in the input text. These are denoted with single braces, consisting of two parts separated by single or double slashes, i.e. {SOURCE/RESPELLING} or {SOURCE//RESPELLING} (the latter format is supported in order to allow for embedded slashes in the source or respelling). Spaces are allowed before or after a single brace as necessary if the brace directly abuts a double or triple brace of a template call or parameter substitution. The idea is that the SOURCE part shows up in the display text but the RESPELLING part is used during transliteration. This supersedes the |subst= parameter to usexes and quotes, and should be generalized for all languages with auto-translit.
Now, let's start with some definitions of the different possible text variants:
The input text is what the user supplies in wikitext, in the parameters to {{m}}, {{l}}, {{ux}}, {{t}}, {{lang}} and the like.
The display text is the text in the form as it will be displayed to the user. This can include accent marks that are stripped to form the entry text (see below), as well as embedded bracketed links that are variously processed further. The display text is generated from the input text by applying language-specific transformations; for most languages, there will be no such transformations. Examples of transformations are bad-character replacements for certain languages (e.g. l or 1 to palochka?); and for Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions (see above).
The entry text is the text in the form used to generate a link to a Wiktionary entry. This is usually generated from the display text by stripping certain sorts of diacritics on a per-language basis, and sometimes doing other transformations. The concept of entry text only really makes sense for text that does not contain embedded links, meaning that display text containing embedded links will need to have the links individually processed to get per-link entry text in order to generate the resolved display text (see below).
The resolved display text is the result of resolving embedded links in the display text (e.g. converting them to two-part links where the first part has entry-text transformations applied, and adding appropriate language-specific fragments) and adding appropriate language and script tagging. This text can be passed directly to MediaWiki for display.
The source translit text is the text as supplied to the language-specific transliterate() method. The form of the source translit text may need to be language-specific, e.g Thai and Khmer will need the full unprocessed input text, whereas other languages may need to work off the display text. It's still unclear to me how embedded bracketed links are handled in the existing code. In general, such embedded links need to be removed (i.e. converted to their "bare display" form by taking the right part of two-part links and removing double brackets), but when this happens is unclear to me. Some languages have a chop-up-and-paste-together scheme that sends parts of the text through the transliterate mechanism, and for others (those listed in contiguous_substition in Module:languages/data) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is still unclear to me.)
The transliterated text (or transliteration) is the result of transliterating the source translit text. Unlike for all the other text variants except the transcribed text, it is always in the Latin script.
The transcribed text (or transcription) is the result of transcribing the source translit text, where "transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian, Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form. Unlike for all the other text variants other than the transliterated text, it is always in the Latin script. Currently, the transcribed text is always supplied manually be the user; there is no such thing as a transcribe() method on language objects.
The sort key is the text used in sort keys for determining the placing of pages in categories they belong to. The sort key is generated from the pagename or a specified sort base by lowercasing, doing language-specific transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it needs to be converted to display text, have embedded links removed (i.e. resolving them to their right side if they are two-part links) and have entry text transformations applied.
There are other text variants that occur in usexes (specifically, there are normalized variants of several of the above text variants), but we can skip them for now.
In terms of how the various text variants match up to the custom transformation methods for Language objects, we have the following:
makeDisplayText: This converts input text to display text. It can remain as-is, although the params need to be documented. (In particular: What is keep_prefix and why does sc need to be passed in? What happens if you don't pass in a script ? What are the three return values ?) Also, we need to clean up process_embedded_links(), which ends up calling makeDisplayText first on the input text as a whole and then again on the display portion of individual embedded links, which won't work for Thai and Khmer.
makeEntryName: This converts input or display text to entry text. This needs some rethinking. In particular, makeEntryName is sometimes called on display text (in some paths inside of Module:links) and sometimes called on input text (in other paths inside of Module:links, and usually from other modules). We need to make sure we don't try to convert input text to display text twice, but at the same time we need to support calling it directly on input text since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input or display text; if that former, we call makeDisplayText ourselves.
transliterate: This appears to converts input text with embedded brackets removed into a transliteration. This needs some rethinking. In particular, it calls processDisplayText on its input, which won't work for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code; a lot of callers remove the links themselves before calling transliterate(), which I assume is wrong.
makeSortKey: This needs some auditing of existing callers and what they pass to it.
Finally, I took a look at Module:links and have some comments about process_embedded_links(). It is called in three different places, each of which also has a check for embedded brackets before calling it. We need to refactor things a bit so the check for embedded brackets happens in the beginning of process_embedded_links after calling makeDisplayText, with the the callers calling process_embedded_links unconditionally. This should allow for the Thai and Khmer versions of makeDisplayText to insert links that weren't previously there.
One more thing to add is that we need to nuke the phonetic_extraction stuff, which is only used in Module:links for these same two languages.
Thanks! One more thing, I think all the functions in Module:links should be snake case. The vast majority already are; of the camel case functions, only getLinkPage() is exported. If you don't mind, I'll go ahead and clean this up. Benwing2 (talk) 08:07, 6 December 2023 (UTC)
@Benwing2 Hi Ben - thanks for this. So my thoughts are:
In terms of respellings, I would advise against using single braces (i.e. {SOURCE/RESPELLING}), because this has the potential to cause parsing problems when nested inside templates, since it raises the possibility of a template call ending with }}}. The parser prioritises arguments ({{{...}}}) over templates ({{...}}) irrespective of namespace, so if there's a way for it to interpret }}} as the closure of an argument then it will (e.g. {{{{...}}}} is { + {{{...}}} + }, not {{ + {{...}} + }}). This has the potential to cause confusing bugs - especially since it would only happen in rare, complex situations. Single square brackets are probably the best choice, since (a) it's still possible to distnguish them from wikilinks with a good parser, (b) there's very little chance of confusion with external links, and we can use escapes if necessary, because (c) wikilinks and external links are processed after module calls, not before, so it gives us a level of control that we don't have with braces.
That being said, I really like the idea of having a respelling syntax since it obviates the need for manual transliterations, which may be problematic in other ways (e.g. all the manual Russian translits which use "y" instead of "j", which tends to be happen in translation sections).
keepPrefixes relates to wikilink prefixes, since they're valid targets in links. In some cases, these need to be retained for the display form, since that's how they work in standard links (e.g. ] displays w:some link.
I agree with your assessment of the four functions in Module:languages. I'm keen to solve a lot of these issues using the wikitext parser, because they allow for major simplification of the existing code, and it provides a level of flexibility that we don't have at the moment. For example, the issue of idempotency can easily be solved by setting flags for a given wikilink object within a string of text, which ensures that it doesn't get processed multiple times (even if the parent string is). The parser is now in a state that I'm relatively comfortable with (albeit without the string methods I'd mentioned), so it may be worth using Thai or Khmer as a testbed for this (I'd suggest Khmer, given it's smaller). Theknightwho (talk) 14:59, 7 December 2023 (UTC)
@Theknightwho @Wpi I think single (square) brackets will work. If we ever need to distinguish them from external links, the external links generally contain :/, which won't be in the respellings. The alternative is two-character sequences like (( or милитаризация in place of демилитаризация(дэмилитаризация). As for the wikitext parser, I'm not sure what you're proposing here; and you proposing rewriting Module:links using your wikitext parser? This seems like (a) a big task, and (b) something that may not get done for awhile. When you say your parser is now in a state that you're comfortable with, are you referring to Module:templateparser or something private to you that's still a WIP? I already have written the scraping and parsing code for Thai (using single braces, although it shouldn't be that hard to change it to single brackets). So it might make sense to make the fixes now to Module:links and then deal with rewriting it using the wikitext parser when that's ready. Benwing2 (talk) 20:30, 7 December 2023 (UTC)
Hi. Just a note to @Theknightwho that I would be more comfortable to start with Thai rather than Khmer, as I am more familiar with Thai, have ready texts, sentences, cases to work with and there are more special cases I know of for Thai, such as repetition characters, etc. please see User:Benwing2/test-th-translit and the talk page, which is based on my original User:Atitarev/Thai translit test cases (you probably saw those). I won't insist on Thai vs Khmer, though. Anatoli T.(обсудить/вклад)22:35, 7 December 2023 (UTC)
Knight summoned me here, so here's my 2 gold coins $0.02:
For respellings, I agree that curly braces are not ideal. I think (SOURCE/RESPELLING) or SOURCE(RESPELLING) would be a better option.
(the latter syntax is already used by Korean, in the form of either HANJA(HANGUL) where the Hangul is displayed as ruby over the Hanja, or HANGUL(HANJA) where the HANJA is displayed in the parathensis after)
I think adding the previously discussed |r= would be helpful. {{l|LANG|TERM|r=RESPELLING}} would be equivalent to {{l|LANG|TERM(RESPELLING)}} above (in the case of the link consisting of only one term), but it would save some processing power on the regexes.
Apart from bad character replacements, Thai, Khmer, this would be useful for the various East Asian languages as well (e.g. automatic superscript for Cantonese tones, which is something I've been waiting for ages), and it would also mean we could deprecate {{ja-r}}.
TKW and I were discussing about improving the code efficiency of Module:columns over on Discord, and it seems that format_list_items is inefficient in that it passes the terms one by one to full_link and doing the checks (e.g. determining the script) many times (plus it is only used once in the rest of the module). It would be helpful if, say that the code first detects for a mono-script language, then passing the parameter to full_link. This could be done with (a) some wrapper that handles some repeated actions when there are multiple terms; (b) a preformat function that returns a function that formats the link; or (c) a more logical approach (but far more effort) of having a links function that could handle multiple terms at once. In all of these approaches, it would also be useful for {{doublet}}, {{anagrams}}, {{synonyms}}, etc.
@Wpi BTW the uses of parens you're discussing are orthogonal to the respelling convention, which is intended for transliteration. Possibly we could overload the syntax to have a different meaning for Japanese and Korean (as ruby, etc.) but this would be special-purpose code specific to those languages (this is definitely possible by defining the appropriate makeDisplayText functions for these languages). Can you explain what you mean about automatic superscripting for Cantonese tones? Also, adding an |r= param is actually a lot more work than defining inline respelling, because it would have to be added in every module that supports |tr=, |ts= and the like (which is a lot of them). As for the efficiency of Module:columns, this is orthogonal, but I agree it could be made more efficient (although I'm not sure which repeated actions you're referring to; script checking in general is O(n) so doing it all at once for multiple terms might not be such an efficiency gain, and introduces complexities in the case where the different terms have different scripts). I actually think having a multi_full_link() function or the like is the best approach, because the logic would be encapsulated in one place (and it doesn't seem like it would be "far more effort" to me). Benwing2 (talk) 21:37, 7 December 2023 (UTC)
@Benwing: I see. I was essentially saying that the syntax could be overloaded for ruby. Given that the cons you mentioned, I agree that the syntax would be better.
Automatic superscript is basically wrapping the tone numbers with superscript. It's already present in {{t|yue|一|tr=jat1}}: 一(jat1), but not in links {{m|yue|一|tr=jat1}}: 一(jat1). In general there's some special functionality in Module:translations that ideally should be present in the other link templates as well. – wpi (talk) 03:43, 8 December 2023 (UTC)
Hi guys, @Theknightwho and @Benwing2. Please let me know if you're still interested/motivated to work on Thai or Khmer transliteration scrapers. Are there any serious problems or just the complexity? I don't understand the current hurdles and changes are not transparent but unfortunately, I probably won't be able to contribute at this stage.
@Benwing2, @Theknightwho: Hello, sorry to bug you again but just wanted to check if you're still planning to work on the transliteration scraper for Thai (and Khmer). It looks like you gave up on it. Is it too difficult or you lost interest? I thought it won't be too much different from the Chinese topolect methods. If making re-spellings work makes it too complex, happy to drop this as a requirement. Anatoli T.(обсудить/вклад)05:57, 16 January 2024 (UTC)
@Atitarev Sorry, I haven't given up but it's gotten displaced a bit in the priority list. It requires some rewriting of Module:links, and I need to take some time to understand the dependencies between this code and other code, since User:Theknightwho did a lot of changes to the code. Benwing2 (talk) 06:04, 16 January 2024 (UTC)
Must you?
Latest comment: 11 months ago22 comments2 people in discussion
There was nothing wrong with the edits you've reverted. I suspect that had anyone else made those edits, you'd not have paid them a second glance. You seem to be on the warpath. AP295 (talk) 02:57, 11 December 2023 (UTC)
1) I had made it more precise, indicating why the practice is employed (to loan out the money and generate interest)
2) The link was quite relevant, as anyone can clearly see by the fact that nominalization is a common feature of wooden language.
I feel as though you're trying to bait me into an edit war, because the edits were obviously fine and I've done nothing wrong at all for you to raise an issue about, nor will I. AP295 (talk) 03:06, 11 December 2023 (UTC)
@AP295 You conflated deposits with deposit liabilities, and we don't add generic, unlabelled links between vaguely related concepts because they're pointless clutter - in fact, the connection feels like a non sequitur, so I can only conclude you were simply injecting your personal feeling into the entry as though it were objectively important, as you do everywhere else. You just give the impression of being someone that always has to feel right - I've got no time for that, and neither do most people. Theknightwho (talk) 03:12, 11 December 2023 (UTC)
1) A deposit is a liability, as I understand. If there's a meaningful distinction to be made that I'm simply not understanding then make it, but don't rollback all of my edits to the page wholesale.
2) The concepts are quite related but fine, I won't add the link back. Again though, why did you rollback my changes to the definition itself instead of just removing the link? You've reverted my changes on two separate pages.
Because you're giving me a whole lot of grief and seem to bear a grudge. You probably figured you could get a rise out of me by unduly reverting some of the work I've done today and then feign indignance when I become upset about it. Seems obvious enough. AP295 (talk) 03:28, 11 December 2023 (UTC)
@AP295 The much more obvious reason is that I though the edits were detrimental, and you find it difficult to accept criticism so come up with bullshit reasons to dismiss it as a personal attack. From the get-go, you've been making accusations about grand conspiracies against you, and frankly it's all a bit pathetic. Theknightwho (talk) 03:31, 11 December 2023 (UTC)
I've already explained why they weren't detrimental, and at the very least that there was no reason to roll them back entirely. AP295 (talk) 03:32, 11 December 2023 (UTC)
So yes, I think you're being unfair, and quite clearly trying to provoke me by unduly rolling back a bunch of my edits all at once. AP295 (talk) 03:36, 11 December 2023 (UTC)
@AP295 No, you've explained why you think they weren't detrimental and (as usual) think your own opinion is objectively correct. We might disagree as to whether they were detrimental, but the way you consistently play the victim when you don't get your own way is just childish and manipulative. Enough with this - I have better things to do. Theknightwho (talk) 03:41, 11 December 2023 (UTC)
You say that, but then you seem to appear out of the woodwork to contradict me whenever the opportunity arises. You know I really have nothing against you, I don't understand why you must be so wretched toward me. AP295 (talk) 03:45, 11 December 2023 (UTC)
Hopefully you find this a satisfactory compromise for fractional reserve banking. Banks do this precisely so they can lend out other's money at interest, this must be part of the definition. The wikipedia links are also quite relevant. AP295 (talk) 20:03, 11 December 2023 (UTC)
About deposit liabilities: You are correct that deposits are not deposit liabilities, so that was my careless mistake. I think it's best just to remove the term "deposit liabilities" since it's quite misleading. Banks can lend out more than just what their customers have deposited. I suppose I knew this but I wasn't thinking. I've since changed the definition to a practice wherein banks lend out more money than they hold in reserve which is altogether more straightforward. I am upset that I missed this but it shows that the definition was pretty bad before I touched it in the first place. Before I made my edits the entry did not even mention that the entire object of fractional reserve banking is to loan out extra money, so I felt and still feel it's necessary to specifically say that the point is chiefly to generate debt and interest. I should have just replaced it entirely rather than try to rework the bogus definition that was already there. At any rate, I stand by everything else besides my mistake about deposit liabilities. If you already knew all of this, then why didn't you suggest something in the vein of my current definition? The original definition before I ever touched the page was unclear and misleading, and that's what I was trying to address. Even if you didn't have time to help out, you could have just as easily added the word "liabilities" and made a note in the edit summary. This would have been easier than rolling back everything (which you knew I'd complain about) and using that singular mistake to justify the whole thing. It was my intent to clarify the entry, just have a look at my essay on the subject. You did not have to drag me into an argument at all. You've given me a lot of grief but act like I have no right to suspect you're interacting in bad faith. AP295 (talk) 09:13, 12 December 2023 (UTC)
You've reverted my edits again. Why? I've addressed and conceded the one semi-legitimate point you made among the prodigious litany of bullshit you've laid upon me. ] (talk) 10:41, 12 December 2023 (UTC)
@AP295 Because they were bad edits, seemingly motivated by a desire to feel like you were right all along about removing the term “deposit liabilities”. Theknightwho (talk) 10:46, 12 December 2023 (UTC)
Hardly, I've already conceded the point. I removed the phrase because I think it's misleading, and I explain why on the talk page. It's hardly collegial or civil to say that so many of my edits are motivated by personal inadequacies. Please address the point. AP295 (talk) 10:49, 12 December 2023 (UTC)
@AP295 It isn’t misleading - nothing in the definition says that deposit liabilities make up the entirety of what a bank lends out, and no reasonable person would
think it implied that. It was clearly just an excuse for you to remove it, and the end result was a definition that was extremely vague. I’m not interested in your endless post hoc rationalisations for never being wrong about anything - it’s boring. Go away. Theknightwho (talk) 10:57, 12 December 2023 (UTC)
Sorry, but that's not fair at all. "deposit liabilities" is jargon, which the guidelines explicitly say to avoid: 4. Avoid specialized terms. You're being a complete prick about this. AP295 (talk) 11:01, 12 December 2023 (UTC)
@AP295 First you claim it was misleading (it wasn’t), and now you claim it’s jargon (it’s not) - almost like your real problem is that you want to feel justified in having removed it. Please get off my page. Theknightwho (talk) 11:12, 12 December 2023 (UTC)
It is both, and you're stonewalling me. I've started a thread in the tea room and hopefully someone will help to mediate our dispute. You're being obstructive and obviously don't give a damn about the points I'm making. I'll get off your page if you want, but I expect that you'll help resolve this dispute in the tea room or let the edit go through, not simply revert my edits and then tell me to screw off. AP295 (talk) 11:15, 12 December 2023 (UTC)
@AP295 Nothing on Earth could convince you you’re wrong, so I won’t waste time trying. I am now blocking you from this talkpage, as I have repeatedly told you to go away. Theknightwho (talk) 11:18, 12 December 2023 (UTC)
Latest comment: 11 months ago9 comments4 people in discussion
Module:R:Perseus has suddenly stopped recognizing Ancient Greek ὁ(ho) (U+1F41) as (Polyt) Greek, and this is the only thing in the transclusion list that's been changed in the past 24 hours. Looking through the Ancient Greek single-character transclusions, I've found a few with a similar problem: Ancient Greek ἤ(ḗ) (U+1F24), Ancient Greek ἵ(hí) (U+1F35) and Ancient Greek ὅ(hó) (U+1F45), but Ancient Greek ἆ(â) (U+1F06), Ancient Greek ἕ(hé) (U+1F15), Ancient Greek ὥ(hṓ) (U+1F65), Ancient Greek ὦ(ô) (U+1F66), Ancient Greek ᾗ(hêi) (U+1F97) with no problems. You'll notice that {{m+}} is displaying no transliterations for the ones with the problem, so it's obviously something deeper than just the Perseus module. I'll leave it to you to figure out- I don't have any more time or energy to spend on this tonight, and I'm sure you've gone to bed hours ago. Thanks! Chuck Entz (talk) 07:36, 11 December 2023 (UTC)
I have to ask, what was the point of all the process_range stuff you added recently? What was wrong with the previous way of doing things that required this change? Have you considered the memory implications of your change? As usual I have no idea why you're doing what you're doing. Benwing2 (talk) 08:37, 11 December 2023 (UTC)
Sorry maybe that was a bit harsh, borne out of frustration at seeing you constantly making changes without running them by me and not understanding why you're doing them. In general it would help a lot if you could lay out your plans for future changes to core modules so I'm not left trying to guess at what you're doing. Benwing2 (talk) 11:26, 11 December 2023 (UTC)
The old approach makes testing for scripts slow, since it relies on mw.ustring.gsub with extremely complex patterns. This isn’t the end of the world, but becomes a real problem when dealing with large column templates with hundreds/thousands of terms - this is particularly an issue on big CJK pages, but could happen with any script. The new approach will explode the string and use the ranges field to check each codepoint in turn, which is considerably faster. I haven’t seen any noticeable impact on memory, but clearly I must have made a mistake somewhere with the Greek ranges.
The old patterns were completely opaque and a nightmare to maintain.
In some cases, Unicode normalisation (which is automatic on saving the page) meant that characters were getting swapped around, resulting in the patterns being wrong. Getting around it is possible, but (a) awkward and (b) isn’t something most of us know you need to do.
Thanks. However, are you sure that exploding the ranges makes things faster? This is counterintuitive; normally, it's better in interpreted languages to push the vectorized operations down to C, which is compiled and hence a lot faster than an interpreted language like Lua. Benwing2 (talk) 22:56, 11 December 2023 (UTC)
@Benwing2 I think that would probably be true if we didn't have to deal with the callbacks into PHP, since they seem to be the main reason why the ustring library is really sluggish. Theknightwho (talk) 23:05, 11 December 2023 (UTC)
@Benwing2 Related to this, I think we need to revise how we treat Hant and Hans, because we've got the silly situation that every Sinitic language which uses both ends up doing triple the work it needs to be doing during findBestScript (since they all have Hani as well, which is necessary for terms which are the same in traditional and simplified).
To solve this, I propose the following:
Introduce a new Hants script, which is a "dynamic" script code, to be used with languages that switch between traditional and simplified Chinese depending on the location. This is necessary because:
We can't use Hani, since not every language which uses it switches between traditional and simplified (e.g. Middle Mongol).
Having Hani, Hant and Hans specified individually makes it difficult to avoid the triplication of work, since findBestScript runs for each script in order and treats them in a self-contained way. Even with the triplication, we already have to do a bunch of special-case logic due to the unique way Hant and Hans determine character counts, since we can't use a Lua pattern (too complex).
A handful of languages only use Hant (and in theory, a language could only use Hans), so it's not appropriate to add this to either of those.
Add the key findBestScript in Module:scripts/data, which points to a module that can be called by the findBestScript function (similarly to how translit modules work). This module would then be used to house the special-case logic for traditional and simplified Han, and would be written in such a way as to avoid repeating work. This function would always return two values: the character count and the highest-scoring script object, which would be Hant, Hans or Hani. If a script doesn't have a findBestScript key, then the count is determined using a default fallback method.
This might need some rethinking if we end up with lots of different dynamic "scripts", but in the medium-term I think this works, since it avoids having to completely rewrite how languages handle multiple scripts. Also pinging @Wpi, who I also floated this idea to. Theknightwho (talk) 19:32, 12 December 2023 (UTC)
@Theknightwho This sounds OK to me on first glance although I haven't thought about it deeply. Definitely you want @Wpi involved, who proposed a different sort of overhaul of the Han script codes. Benwing2 (talk) 00:35, 13 December 2023 (UTC)
I think this makes sense for the most part. There are two ways to merge with my split-by-country proposal, either:
findBestScript also detects for Hantw and Hanhk, which shares 99.9% of the character set (in the standard; in practice many characters are used interchangebly), which means basically most of the time one could only tell that it is Hant but not the subset. Also considering that the main difference between them being fonts, I think the benefit in doing this is very marginal.
Hantw and Hanhk (and other location-specific codes) can only be specified in |sc= and would never be returned from findBestScript. i.e. they would be used only when we want to be more specific.
It might also be desirable to also run a check on the language when |sc=Hani, so that the more specific script code is used, e.g. {{l|vi|sc=Hani}} automatically becomes {{l|vi|sc=Hanvi}}. I'm not exactly sure how this can be done (or they would need to be bot-converted).
Latest comment: 11 months ago1 comment1 person in discussion
Hi, I am trying to sort out what is going in Module:links so I can do the changes needed for Thai and Khmer. Definitely more documentation is needed and the code could stand a rewrite; some of it is approaching spaghetti code (probably due to the way it has evolved over time). I am in particular trying to understand makeDisplayText(): how it works and where it's used. It appears it's currently only used for "bad character" substitution like changing straight apostrophes to Unicode apostrophes or l and 1 into palochka (is that right?). Some questions:
How does iterateSectionSubstitutions() (called by makeDisplayText()) handle embedded links? If they are two-part links, does it send only the display portion of the link through the processing function? If they are one-part links, does it send the whole link through the processing function? What about if there is a one-part link with a fragment?
The keepPrefixes parameter: Can you clarify again what it does and under what circumstances it's invoked? Why, for example, does make_link() check whether the link target and display form are the same, and sets keepPrefixes only if they are different? This seems rather mysterious.
In embedded_language_links, line 498, directly before calling makeDisplayText, it does this: text=text:gsub("%%","%%25"). The comment says "FIXME: Double-escape any percent-signs, because we don't want to treat non-linked text as having percent-encoded characters." which is opaque to me. Is this related to the bug with message ID's in quotations containing % signs that I reported? If so, why is this escaping being done only here, and not every place in Module:links that makeDisplayText is called, i.e. in make_link and process_embedded_links?
Another question: what is the purpose of plain_link and where is it used? It's completely undocumented but I see it forces the language to Undetermined, so presumably it is used for some sort of link generation when you don't want language-specific processing to happen.
Finally, I am thinking of allowing things like makeDisplayText() and makeEntryName() to take an object in place of a string for the text being processed, which will facilitate keeping track of the different text variants (see my comment above and the long comment I added to the top of Module:languages). This idea isn't flushed out yet, though. Benwing2 (talk) 09:08, 11 December 2023 (UTC)
Need help
Latest comment: 11 months ago1 comment1 person in discussion
Latest comment: 10 months ago9 comments3 people in discussion
The page a has managed to jump from 77MB to 84MB, and I suspect one of your "optimizations" is the culprit. Can you please do a bit of investigation and see whether your ranges change, or this change (which you previously reverted due to memory issues, then promptly restored once we got some memory buffer), or changes to Han serialization, or some other recent core module change did this? If necessary, revert your changes temporarily and see whether it goes down again. We *CANNOT* have big memory increases like this as a result of core module changes, and we cannot simply blame them on vagaries of the Lua implementation. (BTW in the past you've tended to blow me off when I've asked for such investigations; please don't do it again.) Benwing2 (talk) 04:41, 15 December 2023 (UTC)
@Benwing2 One thing that we've generally overlooked is that serialisation is much, much slower than table lookups, which is also a major problem on pages like a - we absolutely have to balance the two issues, because otherwise we'll end up saving memory only to result in the bottom of the page not being visible again because it takes longer than 10 seconds to load. Theknightwho (talk) 04:45, 15 December 2023 (UTC)
@Benwing2 Just to be clear about what I was doing: the change you linked and the major changes to Module:Hani-sortkey were about eliminating the slowdown caused by serialisation to the best possible extent. In the same vein, my recent rewrites of Module:zh-translit and Module:zh-see were about prioritising speed/accuracy.
The biggest impact on memory is going to have been the creation of Module:data/entities (named HTML entity lookup), Module:data/interwikis (normalised interwiki prefix lookup) and Module:data/namespaces (normalised namespace lookup). However, this was in the context of the memory buffer having doubled, while we were pushing 9.5 seconds+ on some of the largest pages; a was frequently timing out, even after the lite templates were replaced (due to loading time variance). Each of them eliminates wasteful work that was otherwise being done, and they're the main reason why a now loads in 8-9 seconds (looking only at Lua loading time). Theknightwho (talk) 05:25, 15 December 2023 (UTC)
OK, but the bump from 77M to 84M is recent (within the last week or so), and a wasn't timing out before then, and the Module:data submodules you're linking to have last-changed dates of October. So it wasn't those modules that caused the huge memory bloat. It's something you did recently. Benwing2 (talk) 10:26, 15 December 2023 (UTC)
Also, I've asked you many times to run by me any major changes you're planning on making to core modules, yet you continue to act unilaterally in these changes. I am letting you know when I plan to make major core module changes and I'd expect the same courtesy from you; otherwise I can't keep track of what you're changing and why (since your changelog messages are very sparse and you don't bother including explanatory comments at the top of modules the way I do). Benwing2 (talk) 10:29, 15 December 2023 (UTC)
@Benwing2 I have a feeling this has been down to people gradually replacing the lite templates in things like conjugation templates. I just did a mass replacement, due to the speed issues the lite templates cause, and it's caused a to spike to 90MB. I could reverse these changes (I don't mind), but in all honesty this simply exposed an issue that was already there, rather than creating a new one. Theknightwho (talk) 20:54, 18 December 2023 (UTC)
I, on the other hand, appreciate your hard work (even though I haven't the foggiest notion of what exactly you're doing or why you're doing it!). Keep on editing and making WT a better place, and let the mild riling you're causing Benwing a mere side effect. Denazz (talk) 10:41, 15 December 2023 (UTC)
Mongolian hidden -n
Latest comment: 10 months ago2 comments1 person in discussion
Since I found few examples of Mongolian borrowed words with hidden -n on the Internet, we may now discuss their usages. May I have your advice, please? Hidden -n forms both attributive (before nouns) and oblique (before postpositions) cases, whereas for loanwords, the former form in actual use seems to be quite rare. I found forms attested as танкан дээр, автобусан доторх, банкан дахь, but just танк цэрэг, автобус билет, банк хороо. Please help me if you’re familiar with those rules! My edits might be incorrect but I didn’t want to be a vandal. Thanks. LibCae (talk) 17:02, 22 December 2023 (UTC)
Latest comment: 10 months ago2 comments2 people in discussion
@Theknightwho Hello again. I saw you used the term ‘privative’ for Mongolian -гүй. How do you think about the contrast between ‘possessive’ -т(ай) and ‘privative’ -гүй offered by Juha Janhunen in his Mongolian, p. 109? @Nominkhana arslang Do you like Janhunen’s naming way? At least Mongolists know the differences between the functions, and they don’t confused them with English with or Russian с. LibCae (talk) 10:34, 26 December 2023 (UTC)
@ It's better if we make our terminology more "International" whenever possible. "possesive" is widely used in Linguistics for a syntactic function which is similar to genitive, and I think using it may confuse readers. Nominkhana arslang (talk) 11:07, 26 December 2023 (UTC)
About the language module and its sortkeys
Latest comment: 10 months ago1 comment1 person in discussion
Hi. I'm considering an implementation of the Module:languages and the Module:list of languages in Spanish Wiktionary. Basically, I'm interested in creating a database for two things: to get the language name from its code (for the etimology sections of the pages); and to show the tables with all the data like the one it's in Wiktionary:List_of_languages in order to present the information in a clearer way than what we have at the moment on the site.
And since I don't know where would be the proper place to ask these kind of things, and I saw you recently doing some changes on the mentioned module, then I'd like to ask you about the purpose of the sortkeys. I saw there are lot of files that have tables with language data for each language, like the name, code, script, etc.; but I can't see how do the sortkeys come into action. I mean, I guess they must sort something but I can't figure out what. In other words: What would happen if the sortkeys didn't exist?
Additionally, if you have any extra advice, like some specific detail you found that I should be aware of (besides the fixmes) I'd appreciate if you can tell me. Thanks for your time. Tmagc (talk) 04:09, 28 December 2023 (UTC)
@Chuck Entz It's when the alt text doesn't start with *, and I've been correcting them on-and-off all day. About half of them are mistakes, and the other half are due to compounds where people (understandably) don't want to repeat the asterisk, but it's very easy to get around by doing something like {{l|ine-pro|*]-]}} or whatever. Theknightwho (talk) 23:10, 29 December 2023 (UTC)
Spacing before ";"
Latest comment: 10 months ago3 comments2 people in discussion
I use the wikitext leading semicolon to provide subheadings for Further reading etc. on many taxonomic entries for genera and higher taxa that have multiple taxa (usually just two). Without two spaces the subheadings appear too close to items before them. But recently a filter has been introduced that effectively prohibits two consecutive spaces. Is there a way to provide extra space before this kind of subheading? Is there another solution that fits WT:ELE? DCDuring (talk) 18:19, 31 December 2023 (UTC)
I thought you'd know. I don't remember where it is documented.
0. This is text with no following newline.
This the result of a leading semicolon with no preceding newline.
1. This is text with one following newline.
This the result of a leading semicolon following a single newline.
2. This is text with two following newlines.
This the result of a leading semicolon following two newlines
As you can see there is no difference between no following newline and one following newline. The resulting space with two newlines is better, but excessive. Specifically note that there is less apparent space before the first "This ... no following newline" than after it. As the leading semicolon is used to make subordinate headings for following text, the current result defeats the purpose, especially if two consecutive newlines are forbidden by a filter. DCDuring (talk) 04:07, 2 January 2024 (UTC)