This page contains technical documentation explaining how the Todo Lists project works.
The Todo Lists project runs on Toolforge. The tool name is wikt-todo
.
If you are a maintainer of the project, you can manage the tool by logging into a Toolforge shell using ssh
(see the quick start guide) and typing become wikt-todo
. All shell commands on this page will only work if you have "become" the tool account.
The code for the Todo Lists project lives at https://gitlab.wikimedia.org/toolforge-repos/wikt-todo.
On Toolforge, there is a copy of this repo in the src
directory under the wikt-todo
tool account's home directory.
If new commits have been made to the repo, you can update the copy of the repo on Toolforge by running:
cd src git reset --hard # erases local changes from any mucking around or testing git pull origin
Automatic generation of the SQL and custom todo lists is currently scheduled to take place every week, at a time determined by Toolforge. The XML dump todo list script is scheduled to run every day, but the script does nothing unless a new dump file is identified.
Job scheduling is defined in the jobs.yaml
file. The format is documented here.
If you make changes to the jobs.yaml
file, you must reload it using:
toolforge-jobs load src/jobs.yaml
If you get a failure email regarding a scheduled tool run, inspect the generate-lists-type.err
log file in the tool's directory.
If the lists are silently failing to be generated, and upon inspecting the generate-lists-type.err
log file you see the word "Killed", it's likely that the job exceeded the memory limit imposed by Kubernetes. You can verify this by looking at the memory graphs at the Grafana dashboard.
The default per-job memory limit is 512 MiB. This should be more than sufficient for most purposes. If the job is running out of memory, there might be a buggy script that is trying to include almost every page on Wiktionary in its result set. You can run a job manually with increased memory by appending, say, --mem 3Gi
(for 3 GiB) to the toolforge-jobs run
command line below. This should allow the job to complete and you will hopefully be able to work out what is going wrong by looking at the resulting todo list.
SQL-based todo list pages have an "Update now" button. When clicked, the user is taken to https://wikt-todo.toolforge.org/updater/update/Todo list name
. This page is served by the Toolforge web service documented at wikitech:Help:Toolforge/Web/Python.
The web server code is in updater_webapp.py
. This is symlinked from $HOME/www/python/src/app.py
, the mandatory location for Python web applications. Private configuration parameters are in $HOME/www/python/src/config.yaml
.
If the website goes down, the web service can be started or restarted using:
webservice restart
Logs for the web service itself are in uwsgi.log
, while logs for the generate-lists-* script itself are in ad-hoc-updates.log
.
To run a single todo list on an ad hoc basis, run the following command, where type
is one of sql
, xmldump
or custom
, and Todo list name
is the exact name of the desired todo list:
toolforge-jobs run mytodo --command "~/pyvenv/bin/python src/generate_lists_type.py 'Todo list name'" --image python3.11
Keep an eye on the status of the job using:
toolforge-jobs list
Once the job finishes, it will no longer be present in the list. Any error output will be saved to mytodo.err
in the tool account's home directory - run cat mytodo.err
to view it. Regular print statement output will be saved to mytodo.out
. Consider deleting these output files once done to keep the tool directory clean.
Currently there are three ways to generate todo lists: SQL, XML dump and custom. It is intended to add HTML dump as a fourth type of todo list at some stage in the future.
SQL-based todo lists are generated by simply running an SQL query against Wiktionary's database (or, more precisely, the read-only database replicas available through Toolforge) and formatting the output.
SQL is the best option for any todo list that does not require analysis of page content (wikitext). Anything relating to:
can be achieved using pure SQL.
SQL queries run very quickly if written well. Most of the SQL-based todo lists take just a few seconds to generate. However, it can be challenging to write an SQL query that is both correct and fast, especially if you do not have much experience with SQL. The MediaWiki relational database structure diagram and list of special views on Toolforge are essential references.
sql/queries.py
. To define a new SQL todo list, simply add a new entry to the queries
dictionary.sql enwiktionary
) to test it.XML dump-based todo lists are generated by iterating through every line of wikitext (page source code) on every page of the latest Wiktionary XML dump and running Python code to compile the resulting todo list.
These todo lists are a good choice for detecting common misuses of templates or wikitext formatting. The XML dump contains the wikitext of the latest revision of each page, along with the page ID, namespace and title, but little else.
An abstraction layer is provided so that the Python code for each todo list does not need to parse XML. For convenience, the abstraction layer keeps track of a hierarchy of section headings and, in some cases, supplies a parsed version of the line of wikitext alongside the wikitext itself. The Python script is free to run SQL queries or API requests as required.
xmldump/
directory. To define a new todo list, make a copy of the !template.py
file in the relevant directory and write your code.pages-articles
or pages-meta-current
XML dump and run the script locally:python3 generate_lists_xmldump.py 'Todo list name' --dry-run --file /path/to/xml-dump.bz2 or .xml
Custom todo lists simply involve running a Python script that returns a table of results. In practice, these todo lists typically run an SQL query, then perform some kind of post-processing on the query results to generate the todo list.
custom/
directory.--dry-run
parameter to the Python command. Output will be appended to mytodo.out
, so make sure to delete that file before running the todo list.The Enterprise Wikimedia HTML dumps are convenient for certain kinds of analysis where the fully rendered page is required. These dumps, which contain page wikitext and categorisation information alongside the page HTML, are a superset of the XML dump for main namespace pages. However, Enterprise Wikimedia only generates HTML dumps for select namespaces; our custom namespaces, such as Reconstruction, are not included.
The output of each todo list is a list of dictionaries. The SQL generator converts the SQL query result set into this format. For Python-based todo lists, the code must return a list of dictionaries, where every dictionary in the list has exactly the same keys. The column names/key names need to adhere to a special format, as explained below.
This data is converted to wikitext, using a sortable table format if more than one key is present in the dictionary besides the optional SECTIONHEADING
, or a bulleted list format otherwise.
The todo list output can optionally be divided into sections (L4 headers). This is achieved by adding a special SECTIONHEADING
column (key) to the output.
It is critical that the output is sorted first by this section heading key! Otherwise the section headings will be uselessly repeated in random parts of the list.
Every column name (key name) except SECTIONHEADING
must contain at least one formatting code. These are written in ALL CAPS and placed after the displayed column name, set off by underscores, for example, Creation date_RAW
or Page_NSTITLE_EDITLINK
.
For convenience in SQL queries, underscores will be replaced by spaces in the column name itself, so Creation_date_RAW
would work too.
The formatting codes are defined in the format_data_value
function in output_formatting.py
. Here is a summary:
pageTitle
parameter already includes the namespace name.0|dictionary
, 3|This,_that_and_the_other
or 4|Requests for verification
, made up of the namespace number, a pipe character, and the page title (either underscores or spaces are fine).CONCAT(page_namespace, '|', page_title)
or equivalent.{{subst:NS:...}}
, or templates, like {{subst:langname|...}}
, into the output.<nowiki>
tags.<code><nowiki>
tags.<code><nowiki>
tags, keeping only the first 50 characters and adding ... if more characters are present.<code><nowiki>
tags, keeping only the first 100 characters and adding ... if more characters are present.