The following discussion has been moved from the page User talk:Jberkel.
This discussion is no longer live and is left here as an archive. Please do not modify this conversation, but feel free to discuss its conclusions.
This page shows conversations on my talkpage from 2024.
Hi, just a note to be careful when adding Catalan pronunciations. For example, you added a pronunciation of ê
to esquetx, which is wrong (it should be é
) and unlikely in any case, since ê
generally only occurs with inheritances and some old borrowings, and esquetx is a recent borrowing from English. I have documented the sources of pronunciation in the documentation to {{ca-IPA}}
; in particular, only trust the DCVB for Balearic pronunciations and don't trust cawikt at all. Benwing2 (talk) 02:34, 28 January 2024 (UTC)
Hi Jberkel, willst du noch einen neunen Update der Statistik machen? Dein letzter stammt schon wieder vom 1. Juli. Ja, ich weiß dass es eine Menge Zeit und Computerkraft beansprucht, aber ich denke wir alle möchten das einfach schon mal wieder wissen. :) Steinbach (talk) 17:18, 22 February 2024 (UTC)
Hi, I saw your posts complaining about the lack of HTML dumps as I had the same issue. I ended up creating my own HTML dump using the API to rapidly download millions of entries. I used the 20240220 XML dump as a base so that the two dumps would include exactly the same revisions. Note that the same wikitext can produce different HTML code at different points in time, so I can't guarantee that the page looks exactly as it did at the time of the XML dump.
Would you be interested in the code or the dump itself?
Ioaxxere (talk) 20:05, 22 February 2024 (UTC)
The script works by grabbing HTML data using a revision ID. For example: https://en.wiktionary.org/w/api.php?action=parse&oldid=65853771&format=json. I'm not sure what parser is used but it seems to correspond with "view page source" in my browser. Here is the code:
import requests
import concurrent.futures
from time import time, sleep
from random import random
import mmap
import re
BATCH_SIZE = 10000
HEADER = {"User-Agent": "User:Ioaxxere"} # replace with your username
# tuned parameters
RATE_LIMIT = 80 # per second
THREAD_COUNT = 100
def fetch_data(revid):
print(revid)
while True:
starttime = time()
try:
result = requests.get(f"https://en.wiktionary.org/w/api.php?action=parse&oldid={revid}&format=json", headers=HEADER)
if result.status_code == 200: # OK
break
print("...error:", result.status_code)
except:
print("...error: Connection failed")
sleep(0.5 * (1 + random()))
waittime = THREAD_COUNT/RATE_LIMIT - (time() - starttime)
if waittime > 0:
sleep(waittime)
return result.text
def big_file_finditer(filename, pattern, flags=""):
compiled_pattern = re.compile(pattern.encode(), flags)
with open(filename, "r") as f:
return compiled_pattern.finditer(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))
pages =
for i in range(0, len(pages), BATCH_SIZE):
queries = pages
with concurrent.futures.ThreadPoolExecutor(max_workers=THREAD_COUNT) as executor:
output = executor.map(fetch_data, queries)
output = "\n".join(q for q in output) + "\n"
open(r"D:\wiktionarydumps\output.ndjson", "a", encoding="utf-8").write(output) # replace with your output location
Then I verified the output with this code:
import re
n = 0
with open(r"D:\wiktionarydumps\output.ndjson", "r", encoding="utf-8") as f:
for line in f:
n += 1
if line.startswith("{\"error\":"):
print(n, re.findall("\"code\":\"(+)\"", line))
Which produced:
43725 nosuchrevid
82006 nosuchrevid
106857 nosuchrevid
248730 nosuchrevid
319048 nosuchrevid
323556 nosuchrevid
330049 nosuchrevid
394498 nosuchrevid
437859 nosuchrevid
448121 nosuchrevid
561668 nosuchrevid
590865 nosuchrevid
603650 nosuchrevid
610405 nosuchrevid
720072 nosuchrevid
749333 nosuchrevid
808355 nosuchrevid
814281 nosuchrevid
822969 nosuchrevid
859557 nosuchrevid
1021390 nosuchrevid
1036457 nosuchrevid
1058296 nosuchrevid
1084837 nosuchrevid
1157698 nosuchrevid
1229978 nosuchrevid
1248685 nosuchrevid
1285246 nosuchrevid
1323983 nosuchrevid
1324915 nosuchrevid
1385186 nosuchrevid
1396962 nosuchrevid
1486775 nosuchrevid
1497989 nosuchrevid
1513303 nosuchrevid
1581275 nosuchrevid
1609470 nosuchrevid
1678410 nosuchrevid
1725167 nosuchrevid
1735366 nosuchrevid
1735744 nosuchrevid
1814983 nosuchrevid
1854120 nosuchrevid
1907407 nosuchrevid
1921876 nosuchrevid
1963831 nosuchrevid
2010212 nosuchrevid
2073363 nosuchrevid
2166069 nosuchrevid
2177988 nosuchrevid
2183914 nosuchrevid
2184460 nosuchrevid
2278457 nosuchrevid
2330349 nosuchrevid
2358375 nosuchrevid
2499758 nosuchrevid
2501157 nosuchrevid
2520901 nosuchrevid
2591419 nosuchrevid
2621251 nosuchrevid
2630284 nosuchrevid
2671770 nosuchrevid
2696918 nosuchrevid
2697777 nosuchrevid
2746586 nosuchrevid
2769872 nosuchrevid
2831640 nosuchrevid
2857869 nosuchrevid
2910282 nosuchrevid
2911183 nosuchrevid
2915318 nosuchrevid
2967304 nosuchrevid
3014563 nosuchrevid
3063851 nosuchrevid
3124420 nosuchrevid
3137890 nosuchrevid
3185708 nosuchrevid
3225411 nosuchrevid
3230226 nosuchrevid
3241060 nosuchrevid
3259739 nosuchrevid
3261952 nosuchrevid
3301323 nosuchrevid
3318285 nosuchrevid
3320219 nosuchrevid
3324414 nosuchrevid
3336037 nosuchrevid
3443783 nosuchrevid
3481014 nosuchrevid
3527574 nosuchrevid
3585227 nosuchrevid
3589765 nosuchrevid
3614305 nosuchrevid
3734605 nosuchrevid
3821927 nosuchrevid
3843626 nosuchrevid
3914931 nosuchrevid
3925139 nosuchrevid
4025930 nosuchrevid
4244319 nosuchrevid
4246017 nosuchrevid
4260112 nosuchrevid
4278061 nosuchrevid
4330469 nosuchrevid
4331657 nosuchrevid
4412350 nosuchrevid
4413758 nosuchrevid
4432652 nosuchrevid
4485019 nosuchrevid
4602733 nosuchrevid
4608289 nosuchrevid
4720573 nosuchrevid
4737790 nosuchrevid
4858538 nosuchrevid
4889458 nosuchrevid
4908594 nosuchrevid
4973122 nosuchrevid
5010716 nosuchrevid
5052814 nosuchrevid
5150511 nosuchrevid
5154623 nosuchrevid
5182578 nosuchrevid
5223840 nosuchrevid
5235533 nosuchrevid
5246229 nosuchrevid
5259002 nosuchrevid
5344233 nosuchrevid
5364980 nosuchrevid
5368363 nosuchrevid
5369738 nosuchrevid
5469778 nosuchrevid
5507943 nosuchrevid
5598277 nosuchrevid
5607802 nosuchrevid
5631256 nosuchrevid
5648406 nosuchrevid
5659237 nosuchrevid
5729700 nosuchrevid
5752778 nosuchrevid
5774071 nosuchrevid
5790022 nosuchrevid
5833505 nosuchrevid
5861520 nosuchrevid
5864017 nosuchrevid
5871030 nosuchrevid
5877754 nosuchrevid
5983008 nosuchrevid
6006358 nosuchrevid
6067067 nosuchrevid
6085428 nosuchrevid
6138076 nosuchrevid
6138136 nosuchrevid
6188278 nosuchrevid
6248831 nosuchrevid
6276367 nosuchrevid
6286098 nosuchrevid
6289698 nosuchrevid
6293458 nosuchrevid
6303351 nosuchrevid
6309621 nosuchrevid
6311475 nosuchrevid
6391744 nosuchrevid
6392577 nosuchrevid
6396159 nosuchrevid
6409595 nosuchrevid
6412793 nosuchrevid
6424036 nosuchrevid
6484785 nosuchrevid
6562806 nosuchrevid
6568126 nosuchrevid
6580802 nosuchrevid
6633849 nosuchrevid
6741033 nosuchrevid
6797937 nosuchrevid
6900647 nosuchrevid
6903671 nosuchrevid
6996408 nosuchrevid
6996487 nosuchrevid
7030860 nosuchrevid
7043778 nosuchrevid
7048043 nosuchrevid
7059900 nosuchrevid
7091062 nosuchrevid
7091425 nosuchrevid
7130255 nosuchrevid
7169063 nosuchrevid
7184906 nosuchrevid
7244549 nosuchrevid
7276644 nosuchrevid
7331248 nosuchrevid
7359021 nosuchrevid
7537357 nosuchrevid
7578135 nosuchrevid
7585843 nosuchrevid
7595812 nosuchrevid
7641806 nosuchrevid
7651915 nosuchrevid
7697219 nosuchrevid
7778037 nosuchrevid
7781476 nosuchrevid
7782612 nosuchrevid
7802193 nosuchrevid
7808302 nosuchrevid
7820909 nosuchrevid
7885180 nosuchrevid
7914802 nosuchrevid
These correspond with pages in the XML dump that have recently been deleted.
I don't have the time/resources to generate these on a regular basis, but you're welcome to adapt this code for your purposes!
Ioaxxere (talk) 19:56, 23 February 2024 (UTC)
&parsoid=true
to the API query gives *far* better data. Time to rerun... Ioaxxere (talk) 20:09, 23 February 2024 (UTC)
I just discovered there are two unit testing frameworks here, Module:UnitTests used by everyone but you, and Module:ScribuntoUnit used by you. The former is older than the latter, so I'm not sure why you imported the latter from Wikipedia, but I think we should consolidate. Can you think about converting your unit tests to use Module:UnitTests? Benwing2 (talk) 20:34, 10 March 2024 (UTC)
Wwoww, Jberkel, you're fast. Wanted to cite the same Guardian passage here, and it was already there ... MistaPPPP (talk) 12:55, 19 March 2024 (UTC)
I need to apologise to you also, about my simple edit in my archaic paragraph about certain 'etymologies that discredit Wiktionary' that it should have completely disrupted the edit section including yours - there should really be mechanism in place to stop this from happening, since any innocent editor could well make a similar mistake that if not detected quickly as both Surjection and I did, it could cause linguistic mayhem! Regards, Andrew Andrew H. Gray 11:40, 29 March 2024 (UTC)
What Doyle said was about this:
https://en.m.wiktionary.orghttps://dictious.com/en/arse#English
Here, ass is another way of spelling arse (as in dumb). Lunatone3000 (talk) 22:24, 4 April 2024 (UTC)
You mentioned this in a beer parlour comment about "the reputation system, for good or ill".
The reputation system is for ill.
There are editors like me whose behavior is scrutinized. And people are willing to make inaccurate claims about how many or few productive edits I've
Then there are other editors who have almost no ability at all to get along with other editors or admit wrongdoing. But, because they're perceived as being essential to the project, it's unacceptable to question their opinions or behavior. Purplebackpack89 13:46, 5 June 2024 (UTC)
User:Jberkel/lists/wanted hasn't bin updated4a while. Can we get it bac, pls? Denazz (talk) 22:28, 5 June 2024 (UTC)
Many of the various long lists on user subpages of yours seem to have served their purpose and/or to no longer be in active use. Also, the same term often appears on multiple subpages, differing only by when they were compiled. The result of this is that using "&sort=incoming_links_desc" in the searchbox to find entries relatively important to other Wiktionary entries does not give a good list. My user pages have had the same effect. I have consequently used <nowiki>
to disable entire subpages. If you are too busy, let me know which pages are important (of what rule to follow to determine importance) so I could disable the right pages, if there are any. You are not the only one with such subpages, but yours are the ones I most notice. DCDuring (talk) 22:23, 17 July 2024 (UTC)
{{...}}
Hello, may I ask you why did you revert me here? Regards, RodRabelo7 (talk) 20:15, 22 July 2024 (UTC)
Nous vous rappelons que les Actualités du Wiktionnaire sont toujours publiées, mais notre système d'annonces n'était plus en service. Veuillez nous excuser pour les inconvénients.
Un nouveau numéro des Actualités du Wiktionnaire vient de paraître !
Dans ces Actualités estivales bien fournies, une revue de presse et une liste de vidéos pour améliorer vos siestes moites, ainsi que trois articles : un dictionnaire de cooccurrences présenté par Trace, une discussion à partir d’un article sur les mots les plus recherchés dans les dictionnaires par Noé et une explication sur les enclises par Àncilu. Le tout enrobé d’illustrations d’actualité.
Découvrez le numéro 112 de juillet 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 19:59, 14 August 2024 (UTC)
User:Jberkel/lists/Frequency links to v2
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html was released 3 years after your last generation, perhaps you might be interested in updating? Akaibu (talk) 03:53, 22 August 2024 (UTC)
Please don't leave etymologies like that, @Trooper57 maybe you can help? Stríðsdrengur (talk) 14:23, 28 August 2024 (UTC)
Ce numéro estival est fort pourvu en actualités et en brèves ! Le dictionnaire du mois est présenté par Trace et porte sur les expressions, tandis que Noé disserte sur l’héritage et l’innovation du Wiktionnaire. Les illustrations viennent de la collection d’un musée de design !
Découvrez le numéro 113 de août 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 13:36, 1 September 2024 (UTC)
Un numéro avec de l’argot et des langues régionales de France ! En plus des habituelles brèves, des statistiques et de la revue de presse, deux articles par Lyokoï et Noé, entourés d’illustrations d’architecture en briques !
Découvrez le numéro 114 de septembre 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 10:50, 1 October 2024 (UTC)
Un numéro placé sous l’auspice de l’Antiquité grecque ! Outre les traditionnelles revue de presse du mois, actualité du projet et statistiques, un article sur l’évolution de l’intelligence artificielle par Romainbehar et la présentation de l’histoire des dictionnaires d’argot par Lyokoï !
Découvrez le numéro 115 de octobre 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 10:21, 1 November 2024 (UTC)
Hi, thanks for making these "wanted terms" lists, they are really useful!
Is there any chance I could ask for some parameters to be tweaked? For example, I think this list (and presumably equivalent lists for other languages) would greatly benefit from having Wiktionary:Requested entries (Welsh), Appendix:Celtic word lists and Appendix:Word lists of languages of Europe able to "contribute" - which as far as I'm aware they currently don't?
Btw, if you could also create an equivalent list for Middle Welsh (wlm), I'd be a very happy editor.
Cheers Arafsymudwr (talk) 17:30, 2 November 2024 (UTC)
Beaucoup de discussions ce mois-ci, ainsi que quatre chroniques ! Trois sorties majeures : la neuvième édition du Dictionnaire de l’Académie française, la Dicothèque pour consulter les entrées des dictionnaires de Wikisource et le Dictionnaire du chilleur de Jérôme 50. Encadré des bilans des incitations éditoriales du mois, un retour sur la Wikiconvention francophone vient épaissir encore ce numéro à la longueur record !
Découvrez le numéro 116 de novembre 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 13:20, 1 December 2024 (UTC)
Existe um grupo de editores da parte portuguesa do Wiktionary sob o nome de Aliança Galego-Portuguesa. Eu faço parte dele, e queria lhe convidar a participar também, para que você possa fazer lobby agir pelo melhoramento do projeto com a gente! Entre no nosso servidor do Discord pra ver.
Com carinho, Polomo47 (talk) 04:54, 6 December 2024 (UTC)
The external links on your lists appear to not be working anymore T-T, not a single one. Any chance you could take a look at them? Maybe do a new dump for this month? MedK1 (talk) 23:31, 12 December 2024 (UTC)
Sorry for my rude response, now I realize that I was wrong to you :( Stríðsdrengur (talk) 15:53, 21 December 2024 (UTC)
Could you rerun this code? Just once would be good; it doesn't need to be regular. Was thinking the improvements Wikipedia has seen in the past 9 years could expand the list plenty. Polomo47 (talk) 22:01, 28 December 2024 (UTC)