This task involved a slightly more complex transformation than my previous tasks so far, involving some more error-prone elements that have led me to exercise a lot more caution than before.
The syntax we were seeing previously in the {{ja-readings}}
template was something like {{ja-readings|kun=かんばしい (kanbashii)<!-- Confirm -->|on=かん (kan), がん (gan)}}
. The actual actions of the bot include checking each of these parameters (which can take values "goon", "kanon", "kun", "on", "soon", "toon", "kanyoon", or "nanori"), and manually removing both wikilinks and the parenthesized manual transliteration, given that the template is now able to do these by itself with no user input.
Thus, the problem is now 1) how to remove wikilinks, handling all cases, and 2) how to remove the manual romanization that users had entered, along with whitespace left over in excess as a result.
The first problem, removing wikilinks, I achieved with the following code and mwparserfromhell:
def convert_link_to_plaintext(link: mwparserfromhell.wikicode.Wikilink) -> str:
if link.text is not None:
if link.text == "": return link.title
else: return link.text
else:
return link.title
def links_to_plaintext(text: str) -> str:
parsed: mwparserfromhell.wikicode.Wikicode = mwparserfromhell.parse(text)
links = parsed.filter(forcetype=mwparserfromhell.wikicode.Wikilink)
for link in links:
plain = convert_link_to_plaintext(link)
parsed.replace(link, plain)
return str(parsed)
The idea is to turn e.g. ]
into kan
, ]
into tou
, and ]
into kan
(pipe trick). Though, in fact, it would seem that only the first kind of link was ever used in this particular problem. Therefore, these edge cases were of no grievance.
Secondly, there was also the issue of removing the transliteration. I did this using the regular expression \(\w+?\)
, which just matches any alphabetic character inside parentheses. My regex engine recognizes e.g. ō as an alphabetic character, so there is again no issue here. The only arising problem is the need to strip whitespace off the ends of the string, which is just trivially done using the strip
method in Python.
The difficulty arose in a number of edge cases, which fortunately I had probed enough to anticipate, such as comments in the source code, as in the example above: my code would split the readings on commas, and then re-join them at the end (this would nicely format the comma-separated readings if they were not already). However, if there were commented-out readings as well, then this would ruin the replacement process, as the "readings" produced as a result would have pieces of comment syntax stuck in them, so this needed to be handled. I decided to ignore such cases to avoid errors.
Finally, we also have entries in which the manual transliteration may have been more precise than the automatic one left after my bot has run: for the kanbashii example above, the automatic generation would produce kanbashī, as it has no way of knowing that the long ī should be split into two.
As a result of the above two problem points, we are left with the following kanji pages to check: 糇糚馟幺匀隂龏潡凕夛趠弡妍悒嘻誗抖虗耏轇愪伃葰嶬噏塟譄諬垺陾鸕頇虣誩冪汒啒穧恾鐈韙竕媺愵盖蚍輨膩鈞幂惔僔涷謷磪嶃笿牔姁喛庰剦銊漻襾瑣緡晊嵂駫譞冐蘐啉緍僙姝婻欶蓁怳笰巸嚻蔎顒紘惂撇唫靕唿乢瘈奒陒誏黹豓輮塧劽旲椌唀勻頙姴餤葴偞瓴垨宑扚臰惙崒罄竌姇朎猤嘮錷苠汾陗阥詥奐穡姡鮧桍頡軭斊羏顗陋闟竨痟妎鼛顢饟紒擑汪帡葒墊暍掯鐯懆歵颸嘽菇婜厲咂悘窕嚍頯朩畃譆馗姷惄馛娤贍鷮溱磈宊擵踦歮傇虘棓揔飭啴薘藹琑姢坨縿媙鬒萯絿顑覈綂蛩迵逷姱氶熕幖梈磛啋懁甹敻麖抂鍯韷䮻軱嚾甧逺訞誷鏓賖娙皀枸瑮獝顓黈妗鉊屏麃詉纎趍愓葢鉼鬵礆廫傎帲峆桘 (254 in total).
Otherwise, I have not so far seen any evidence of malfunction or damage, which is reassuring.