User:Kiril kovachev/Bulgarian project endgame

Goals

Originally, this page was nothing more than an introspective essay musing over the length of time we would need to fully maturate the Bulgarian project to the level of perfection on Wiktionary. Although this was partly heartening, because it showed that, with some slightly tempered expectations, we can still expect our coverage and quality to progress splendidly in just a few years, I have now found that it may be possible to aim for even better: a distinct, real, live quantification of these goals based on tracking the current state of progress using e.g. a program/bot, as well as measuring our goals numerically using statistics from Wiktionary itself. Here are some of those goals.

Desired lemmas: 13,091 / 200,000
(Current total entries: 47,614)
Terms with IPA: 42,874 / 47,614
Terms with audio pronunciations: 0 / 47,614
Pages with usage examples: 2,022 / 13,091*
Pages with quotations: 99 / 13,091*
Desired topic categories: 769 / 2,000
Given names: 159 / 5,000
Surnames: 464 / 5,000

Requests:

Requests for attention: 113
Requests for audio pronunciation: 44
Requests for expansion of etymologies: 37
Requests for gender: 0
Requests for IPA/stress: 1
Requests for translation review: 284
Requests for translation into Bulgarian: 564
Requests for translation of usage examples or quotations: 16

* Note: assuming that only lemmas have usage examples or quotes added to them. It is technically possible for this count to overflow the maximum, since adding quotes to every lemma page would satisfy the total, but there could be further quotes/usage examples on non-lemmas as well.

TODO: wikilink the above and propose methods to close the gaps.

An estimate of man-time

I wonder every day just how many human lifetimes we would need to make Wiktionary a perfect dictionary for Bulgarian (and indeed for other languages, too), but it struck me that quantifying this, in fact, was possible. This page is an effort to calculate the total human labor time that we'll need to finish off different key attributes of the Bulgarian project, with specific figures given for the time until the overall attainment of different stages of completion.

I'll also make references to @Chernorizets's goals of the Bulgarian Lemma Improvement Project, which will provide an excellent standard for what we should strive to achieve!

Also, bear in mind that the figures are fairly arbitrary, so there could well be better estimates for some of these figures. Please let me know if you think there's a figure to correct.

Words to include

General lemmas. According to this article (citing Institute for Bulgarian Language), there are 200,000 words in the Bulgarian language. Interpreting this as the number of overall lemmas (noting that RBE already codifies ~120,000 as of the release of volume 15, with the voluminous letters С and Т yet to be printed), this gives us an lower-bound figure for the amount of lemmas we need to cover at the fullest extent of the language.

Why lower-bound?

Because, in fact, there are many words that the dictionaries do not cover, and likely won't do any time soon, e.g. neologisms (some are covered at Neolex, but undoubtedly not all), transparent derivations like relational adjectives, and rare words that aren't always recorded in every dictionary. Nevertheless, it's unlikely the number of core words we would end up covering would be substantially higher than 200,000, in my estimation.

Names. Речник на личните и фамилни имена у българите (Rečnik na ličnite i familni imena u bǎlgarite) covers the given and family names used in Bulgaria, and contains probably around 30,000 entries (it may well say how many, but via estimate: ~60 names per page, and 532 pages of content, 31,920 total names). These are largely etymologically explained, and covers pretty much the whole gamut of names used by people in Bulgaria — but not all names that can be used in the Bulgarian language. In this way, we may end up having more like 40,000 actual names if we were to include a number of foreign-language names with significant Bulgarian usage.

Biology and taxonomy. We have many unique and interesting names for wildlife in Bulgaria, and these are so diverse and various that covering them all would be a tremendous undertaking, but still altgother possible, especially compared to the other areas of improvement. This botanical dictionary covers the widespread names of plants in Bulgarian, translating to and from Latin in two sections (like a bilingual dictionary.) Another estimate (557-315 pages times ~30 names per page) from the Bulgarian-Latin half of the dictionary gives around 7260 species of plant to name. The number of Wikidata entries with animals named in Bulgarian is supposedly 525, but somehow I find that dubious (there are 131045 results in English) and there are likely thousands more of those as well; let us estimate 5000 animal names, because admittedly there are more different species of plant out there.

Scientific and technical terms. This is where we venture into rathe more speculative territory, because this concerns itself mostly with Latinisms and Bulgarian glosses of technical terms that originate mostly in other languages; these are the kind that aren't featured in any dictionary, and even the English project would probably struggle to list them all. For example, the botanical dictionary uses фитолингвистика (fitolingvistika), which, in English would be phytolinguistics. Indeed, there are some results for both of these, but consider how many endless combinations of Latinate roots there are that could theoretically pass the criteria for inclusion. 3 citations spanning several years? Certainly, with such lax criteria, really anything goes, and reading obscure books to even know what words to include is a probable starting point. I don't expect to figure in the whole range of possible terms in this category, but let's put it at a conservative 10,000 (counting also the more widespread technical language, which is poorly represented in more generalist dictionaries, even for English). In this class I also include obscure terms like агитгрупа (agitgrupa), which also may well pass CFI, but are exceptionally rare in actual works and quasi-ad-hoc. (This example was lucky to be featured in a dictionary, {{R:bg:BTR}}).

Old and dialectal spellings. These are a relatively late-game feature to want to cover, but we already include these kinds of entries to some degree, so I will also comment on it here: we have sources such as old books, {{R:bg:RBE}} (often states alternative forms) as well as {{R:bg:Gerov}}. Assuming that 10% of words looked different prior to the 1945 orthography update, and 20% had a yet older spelling, this would add an extra good few ten thousand entries on top of the above-mentioned.

To summarize my estimate for the number of words we will cover:

200,000 base lemmas
10,000 rarer, uncovered words, some modern
40,000 names
12,000 taxa
10,000 scientific jargon
75,000 alternative spellings of the above

For a total of 340,600 lemmas. Now imagine the declined forms!

(I won't cover these here, because if we wanted them, we could just generate 99.9% of them via bot anyway.)

Areas of improvement and their time estimates

Adding etymologies. Not all entries' etymologies are well-known, but we have numerous resources such as {{R:bg:RBE}} and {{R:bg:BER}} for referencing them; doing this can be simple for simple words, e.g. taking 5 minutes, but for more complex words, noting the exact origin and tracking down cognates, etc., may take up to 15. Let's assume around 7 minutes for each etymology for each sense, with 1.2 senses per spelling.

Adding audio. Each audio file is no more than around 2 seconds in length, but checking them afterwards may make it take a bit longer. I can personally produce maybe 300/hour. Note, however, that audio does not need to apply to historical spellings.

Pronunciation updates. As of today, we still have some overhauls to make to the pronunciation module, including finalizing {{bg-pr}}. I estimate another 40 hours of cumulative work, although this may be a bit pessimistic. However, this will discount any other pronunciation-related time sinks, because this will handle rhymes, IPA, hyphenation, and audio, so much work will be automated in this way. The 40 hours also includes a bot to replace existing uses of the old templates with the unified one.

Referencing. Usually this will only take a few minutes, consiering you can just copy+paste {{R:bg:RBE}} and {{R:bg:RBE2}}, and then potentially add {{R:bg:BER}}, or whatever other template is pertinent. Let's say 1 minute for the copy+paste, plus another 2 minute 50% of the time when you need to open some other tab or program to see the reference, for 2 minutes per entry.

Translations. The typical translation doesn't need to take very long, often just copy+pasting the term from a Bulgarian entry to its corresponding English entry. However, sometimes an English entry won't have an existing translation table, and in this case, it would take a bit longer, and other times, it will have some tables, but none that make sense for the Bulgarian entry. I say that around 30% of the time, either of these hitches will be incurred, and the cost of this would be about 5-10 minutes. Adding a translation with no hitch would be just 1 minute or even less, but let's round up. With this, the average time would be maybe 2.5 minutes per translation. Also assume that there are 1.5 translations per Bulgarian lemma.

Interlinks to other languages. Like translations, Bulgarian sometimes appears as a descendant or ancestor of another term in a different language, in which case we should spend around 1-2 minutes adding this. In the case of etymologies in a foreign language, it may end up being a few minutes more because of having to look at sources, but let's say 1.5 minutes per descendant/ancestor, with 30,000 descendants and 500 ancestors (arbitrary figures).

Quotes and usage examples. In order to add quotations, a good deal of time is required to put together all the fields, such as date/year, author, full text, our own translation, text title, and potentially others. I would say 6 or so minutes per quote (although advanced quoters might find it quicker), with at least one quote desired per sense. Let the average Bulgarian entry have 3 senses per the aforementioned 1.2 homographs per spelling, so 3.6 senses on average. Usage examples, which are also of the same vein, may also be provided; the good thing is that they can be made up on the spot, so let's say they take only 3 minutes instead.

Collocations. Not hard, maybe 0.5 minutes per, let one sense per entry have common collocations, and each have 2-4, so around 2 minutes max per entry. Let also only 5% of entries have collocations at all.

Categories and topics. This is hard for me to gauge based on my lack of experience, but using HotCat, this seems to be a relatively short process. The only difficulty would be knowing what category to add for a given entry, which may involve studying the existing categories in use for English. 1.5 minutes per entry.

Derived and related terms. Many entries will be directly related to other terms, which can then be linked. Some of these also have a lot of relations, so can involve copy-pasting quite a few links between pages. As BLIP suggests, we should minmize the duplication where possible, but let's imagine that most entries will be related to at least 1 other term, and that making this relationship will take 0.5 minutes per relation, with on average 4 relations per entry (be that related or derived terms). See-also may be pertinent, but let's exclude that, arriving at 2 minutes per entry. By "most", let's fix 80% of entries as being related to some other term.

Images. In the case where appropriate images exist, it should only take a few minutes to add them, maybe 5. In the case where it's hard to find one on Commons, or there just isn't one, we can spare the time calculation and just assume it'll happen at a later date. Then, not every entry will warrant an image, because many are abstract nouns or actions; take 5% of entries as benefitting from pictures.

Table templates. Translating these from English may take a good deal of time, especially for ones with more obscure character. Let's put it at 15 minutes per table, which may actually be quite optimistic.

Alternate spellings. As listed above, there may end up being quite a lot of these, but creating them should not be that hard, e.g. 1 minute for listing an alternative spelling on the main lemma, and another minute for creating the alternative form itself. Assume each word will have 1 alternative form (averaging out those that have more than 1, and those that have always been the same.).

Creating actual content entries. This encompasses all the categories of word I mentioned in the previous section (spelling aside), and is the majority of the work that we have to do. I assume that each part of speech will take quite different amounts of time to make, with verbs occupying the 1st place for longest creation time.
- Verbs: 15 minutes per verb. Some take a very long time because they have many definitions, remembering and checking the inflection classes takes some time, finding adequate translations for the senses takes time, etc.
- Adjectives: 5-10 minutes per adjective; the majority of adjectives are not as complicated as verbs, since verbs usually have the most distinct meanings of any word. Adjectives are usually easier to translate into Enlgish, have simpler declensions, etc.
- Nouns: 5 minutes per. The easiest part of speech.
- Proper nouns: like nouns, these are not so hard, and may in fact be easier than nouns, since they usually only refer to one thing or one personal/family name. Nevertheless, with the template use being a little different, let's say it's also 5 minutes per as well.

Indeed, there are other parts of speech, but these are the main categories. We are unlikely to spend much time creating new pronouns, interjections, etc., but admittedly there could be some significant time that I'm not accounting for here. I'm going to also assume a breakdown of 10% verbs, 20% adjectives, and 70% (proper) nouns.

Summary

The full time I'm positing here is:

15128150 minutes for all other things;
13000000 minutes for creating the core entries with definitions;

...making, at last, for 28128150 man-minutes, or 468802.5 man-hours of time. In conversion, that's around 53.5 man-years of real time, which equates to... well, a damn lot of work! The good news is that we can accomplish incredible feats of completion and richness (quality over quantity) without scraping anywhere near this level of perfection, and I would guess the work required for a broad, high-coverage dictionary of excellent quality would be an order of magnitude less. With efficient work, i.e. us all getting used to working with every aspect of the editing process, the number would go yet down. And moreover, given that we're already well on the way to that goal — with an existing 11,000 lemmas — the original figure would drop to ~50 years, hence in 5 years of good work, we can surely look back in pride at the beautiful products of our effort.

User:Kiril kovachev/Bulgarian project endgame

Goals

An estimate of man-time

Words to include

Areas of improvement and their time estimates

Summary

Wikious

Boobota

Sagapedia