Wiktionary:Beer parlour/2006/August

This is an archive page that has been kept for historical purposes. The conversations on this page are no longer live.

Beer parlour archives edit

2025

2024

Earlier years

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

December

Swadesh lists

I note that has begun moving the Swadesh lists out of the Appendix and into the main namespace. Is this something that we have agreed upon? I don’t remember seeing a discussion about it anywhere. —Stephen 06:23, 1 August 2006 (UTC)

Absolutely not. I've reverted. — Vildricianus 07:55, 1 August 2006 (UTC)

Templating Min Nan, Mandarin, Cantonese

(see long and productive discussions above)

I've created several templates for use in Chinese/CJK language entries. The names follow a common pattern, the IS 639-3 language code, the script form, and the part of speech. So nan-poj-noun for example is Min Nan, where the headword is in poj, and the entry is a noun. These are used on the "inflection" line, following the POS.

Likewise nan-poj-adj and nan-tra-noun. Using tra and sim for Traditional Chinese and Simplified Chinese.

The parameters are poj=, pojn=, tra= and sim=. So (remembering that the headword here is "Beer Parlour"):

See sû-tián, 辭典 (Min Nan entry), and lâng for examples. The templates add the words to the appropriate categories. We can then have (e.g.) cmn-tra-noun for Mandarin, adding a pin= parameter for the Pinyin form, and so on. Robert Ullmann 11:02, 2 August 2006 (UTC)

I think the templates look good. A couple of comments. There some cases where a word will not have a simplified form. It is best to have this type of word listed in both traditional and simplified categories. Can we do something like {{nan-t&s-noun|poj=kang-hu|t&s=]}} where t&s (or something else that you think is appropriate) would indicate that the word should be included in both Category:nan-tw:Nouns and Category:nan-cn:Nouns and would say simplified and traditional on the headword line? Also, back to my earlier question about changing zh to cmn: should we change all zh tags to cmn? If so, does someone have a bot that can do this? Also, if we decide to go with level two headers for Mandarin, Min Nan and Cantonese, can someone run a bot to change that as well (the bot would have to look for ==Chinese== and change it to ==Mandarin==, unless a level three header like ===Mandarin=== exists, in which case the bot would delete ==Chinese== and change ===Mandarin=== to ==Mandarin==)?

A-cai 11:30, 2 August 2006 (UTC)

That is an example of a case where having named parameters is useful. If you use nan-tra-noun, and don't provide a sim= parameter, we can make it add to both categories. I haven't (yet) written all the conditional code that the templates will need; wanted to keep it very simple for now, while we consider if this abstraction is what we want. We don't need a bot to change zh- catagory tags, with the templates they will never appear in entries, only in the templates. So we can change them when and if desired. On your other point, I'd be very concerned about a bot that considered anything now marked Chinese to be Mandarin. From what I've seen there is too much variation; which is what we are trying to deal with. And see next comment. Robert Ullmann 12:59, 2 August 2006 (UTC)

Explaining the bit about the zh- and -cn, -tw categories: when there are sub-categories to "noun"; using 唐氏綜合症 as an example, we could use {{cmn-tra-noun|pin=Táng shì zònghé zhèng|pinn=tang2shi4zong4he2zheng4|sim=唐氏综合症|cat=Diseases}} and it can generate the cat tag(s) correctly, and can be changed however we like later. Robert Ullmann 13:20, 2 August 2006 (UTC)

I want to point out (as Connel has): we are considering this; we need to take some time before making wholesale changes. The direction is very good, but it is premature to be inventing bots. Right now we are just using sû-tián, 辭典, and lâng as examples. Robert Ullmann 12:59, 2 August 2006 (UTC)

I agree that it is premature to start writing bots. However, I want to make sure that I don't end up having to change everything by hand if we DO decide to go with the new way (which I also think is heading in the right direction). There is a catch-22 for me: the more entries I make, the more I understand what I am looking for in terms of format. However, once I decide I like a new format, I then am forced to go back and change all of the entries that I have made previously if I want a consistent look and feel across all of the entries. I'm not sure how or if that can be avoided. I think of Wiktionary as basic trial and error, followed by beer parlor debates.

This is precisely what templates are for: so we can change the look and feel later (or even per-user), and you can concentrate on the content. Yes, right at this very moment it is a bit uncertain, but we are working on just that. Robert Ullmann 13:58, 2 August 2006 (UTC)

BTW, I hope nobody minds that I have been focusing more on Mandarin than Min Nan. This has been deliberate on my part, since far more English speakers study Mandarin than Min Nan. Someday, if Wiktionary starts to attract a lot more Mandarin contributors, I might feel freer to focus on Min Nan but until then, my primary focus is to put in Mandarin words and phrases that are commonly used by native speakers, but rarely found in other dictionaries. As one person, I think this approach offers the best bang for the buck.

A-cai 13:54, 2 August 2006 (UTC)

WOTD guidelines

I added the material that has been sitting at Wiktionary:Word of the day/Nominations (comment)/guidelines for discussion to Wiktionary:Word of the day/Nominations. RJFJR 13:28, 2 August 2006 (UTC)

Who writes this stuff?

Generally I'm impressed with the breadth of our coverage, but as others have pointed out before, just because we've got entries for a bunch of words doesn't mean they're all good entries. Just now I came across "poker: n. 1. thing one can poke (a fire) with... 3. a type of card game". And that led me to "coals: n. Plural of coal" which isn't just lame, it's wrong; coal is a mass noun that generally has no plural, while coals is a different mass noun which has to do with burning wood, not coal. Sheesh.

Anyway, what projects do we have for systematically reviewing our coverage of basic words? I've been happily delving into esoterica such as po-faced, micturient, and gasogene, but I'd be glad to pitch in on double-checking the "boring", basic words and fixing those that need it, too.

—scs 21:51, 3 August 2006 (UTC)

Yes, that is really all we can do - fix them as we find them. You could go to Special:allpages and work your way through - but it will take a while. Good hunting! SemperBlotto 21:55, 3 August 2006 (UTC) p.s. In British English coal can mean a lump of coal. An old-fashioned warming-pan was filled with "hot coals" - probably not wood.

By the way, that usage is identical in American English. coals can be multiple lumps of coal (to put in a old-fashioned coal stove) or multiple lumps of charcoal for a BBQ. --Connel MacKenzie 09:50, 5 August 2006 (UTC)

Well, what we could do more than that would be to be somewhat systematic about it: start not with all of Special:allwords but rather just a list of "basic" words (we've got several such lists), and provide some way for multiple participating editors to share the work but avoid duplicating effort, by marking off which ones have been reviewed so far. I thought I remembered seeing a couple mentions of such efforts, but if not, I'll throw something together. —scs 22:24, 3 August 2006 (UTC)

There is also WT:PCBEW if you don't feel like a quick 80,000+ edits, though. --Connel MacKenzie 00:22, 4 August 2006 (UTC)

Thanks, I'll check that out; it's probably what I had in mind. —scs 00:48, 4 August 2006 (UTC)

It would be nice to be able to use Special:Short pages and Special:Uncategorized pages for this purpose. In the case of Chinese entries, since all languages are lumped together and Chinese is at the end, I've never been able to make this work. If these two features of Wiki were improved upon, it would help single out the incomplete entries (which tend to be the entries with fewer edits and more errors).

Another possibility would be to develop a Special page that lists entries only edited by a single editor, or alternatively entries with less than X number of edits (pick your number). A-cai 00:59, 4 August 2006 (UTC)

SemperBlotto requested something like that a while back. Very recently, Brion pointed me to http://download.wikimedia.org/enwiktionary/latest/enwiktionary-latest-stub-meta-history.xml.gz so I should be able to churn out something useful soon. (Perhaps later this weekend.) --Connel MacKenzie 09:53, 5 August 2006 (UTC)

Use a pocket dictionary and go through it. I've got one with the 25,000 most basic English words. And take your time for it - cleaning up articles and writing good definitions for basic words while taking the existing translations into account takes ages. Apart from that, randompaging English entries could help.

A-cai, perhaps this special page is something for you. — Vildricianus 08:17, 4 August 2006 (UTC)

Thanks, I'll work on these.

A-cai 10:46, 4 August 2006 (UTC)

Just one comment. If people do change existing defs - except for tosh of course, - please leave a note on the talk page explaining the reason. ALSO A GENERAL PLEA. IF WORDS ARE DISCUSSED IN THE TEA ROOM OR RFC, PLEASE PUT THE DISCUSSION ON THE TALK PAGE OF THE ARTICLE. Andrew massyn 10:09, 5 August 2006 (UTC)

Hrm...in relation to your plea, would it be better to just put a wiki-link to the discussion page under each header when it's added to RFC/Tea Room/RFV? Perhaps that should even become policy? If not, who is the responsible party for the move? (I'm thinking RFV/RFC here) The person who initially added the rftag? Or the administrator who removes it? Jeffqyzt 15:46, 10 August 2006 (UTC)

For rfv I generally add the discussion to the talk page as administrator once a decision has been made. I like the idea of having a link from the discussion rooms to the talk page of the article, and think it would be a good policy. Andrew massyn

relationship between Asian languages

I was just now wikifying the word 親切. Someone had put in the word without a language header and wrote something like "shin-setsu" "kind action" as the definition. I decided to tackle this word, and it reminded me of something I've been trying to document in various ways, but haven't been quite satisfied. This is the issue of the relationship between Asian languages. Now that it seems that we are moving away from marking entries with a generic ==Chinese== header, a new dilemma has arisen. I want to convey that the Japanese word shinsetsu is descended from Chinese. If I put:

==Mandarin==
===Adjective===
親切 (Pinyin qīnqiè, qin1qie4)

kind, cordial

===Descendants===

Japanese: 親切 (しんせつ, shinsetsu)

This implies that the Japanese word shinsetsu is descended from Mandarin. The fact that the word appears in both the Dream of the Red Chamber and Romance of the Three Kingdoms tends to corroborate the fact that shinsetsu descended from Chinese, since relatively few words went from Japanese to Chinese until much later. But that does not prove that the word is descended specifically from Mandarin, it could have easily been any number of other Chinese dialects. This was not a problem when we were labeling the above entry ==Chinese==. Now, it seems a bit awkward. Perhaps ===Relatives=== or ===Cognates=== would be a better header. I don't want to do ===Synonyms===, because shinsetsu is not used in Mandarin. I could do ===Translations===, but that would not drive home the point that shinsetsu and qīnqiè are relatives. Are there any linguistics experts around that can offer advice? A-cai 11:54, 4 August 2006 (UTC)

BTW, this problem becomes more pronounced when you look at the simplified entry (亲切):

==Mandarin==
===Adjective===
亲切 (Pinyin qīnqiè, qin1qie4)

kind, cordial

===Descendants===

Japanese: 親切 (しんせつ, shinsetsu)

I'm thinking of doing this:

==Mandarin==
===Adjective===
亲切 (Pinyin qīnqiè, qin1qie4)

kind, cordial

===Cognates===

Japanese: 親切 (しんせつ, shinsetsu)
Korean: 親切 (친절, ch'in ch'e)

After reading the Wikipedia description of cognate, I am beginning to favor ===Cognates===. I'm curious if anyone has any opinions. A-cai 11:57, 4 August 2006 (UTC)

I can verify that shinsetsu is an ON-reading, meaning that it’s borrowed from Chinese. I think the Chinese article should list it as a cognate, and then the Japanese article can either say it’s from Chinese or simply that it’s an on'yomi. I don’t think there is an on'yomi category, but one might be useful. —Stephen 12:30, 4 August 2006 (UTC)

Thanks for the input. There's just one more issue. Where should I put the ===Cognates=== header? Right now, I'm listing it under the Mandarin header, but should I copy it into each language section (Cantonese, Min Nan etc.) or should I just pick one such as Mandarin (the language with the most number of speakers in this case) to put the header under? Thanks.

A-cai 22:35, 4 August 2006 (UTC)

So far, and we have discussed this before, cognates have always gone in the etymology section. Also so far we've never had a separate section for it but I don't think it's a bad idea. I would suggest that it remain part of the etymology section though so it seems logical to make it a subsection of it, just as the Homophones section is a subsection of the Pronunciation section. — Hippietrail 00:47, 5 August 2006 (UTC)

Thanks again. That sounds good to me. Question, am I correct in assuming that only one etymology section need exist per page if the word has the same etymology across multiple languages (in the case of Asian languages, the likely maximum would be: Cantonese, Japanese, Korean, Mandarin, Min Nan, Vietnamese)?

A-cai 01:16, 5 August 2006 (UTC)

Please strive for one etymology section per language section. Having entries for several languages per page should be seen more as a bug than as a feature and it's one of our goals to eventually provide an option for users to see only the languages they are interested in. Also the etymology sections will be different if Chinese borrowed from Japanese or Japanese borrowed from Chinese for instance. — Hippietrail 01:25, 5 August 2006 (UTC)

I think I now understand. So it should look like this, correct? Here is the original word that started this.

A-cai 01:36, 5 August 2006 (UTC)

User:Widsith grudingly gave up on the Cognates heading not too long ago. I forget now, some of the main arguments I (and others) had against it. I imagine he will be delighted to hear that =Derivatives= may go away, in deference to =Cognates=. But I still have reservations about it. It is a very uncommon term in English...perhaps it should be (gasp) =Cognates=? Or dare I say, (shudder) ={{cognate}}=? --Connel MacKenzie 10:01, 5 August 2006 (UTC)

Properly speaking, the term "cognates" only refers to words inherited from a common ancestor. It doesn't include common loanwords. --Ptcamn 10:06, 5 August 2006 (UTC)

For now, I will use:

===Etymology===
blah blah
====Cognates====

If someone comes up with a better way that we all agree upon, we can revisit the issue.

Ptcamn, I'm not sure I understand your point about common loanwords. Can you give an example of something you consider to be a cognate vs. common loanword? A-cai 13:06, 5 August 2006 (UTC)

Cognates can be traced to a common source through inheritance. For example, English brother was inherited from Old English broþor, which was inherited from Proto-Germanic *brōthar-, which was inherited from Proto-Indo-European *brāther-. Latin frater is also inherited from PIE *brāther-, so brother and frater can be said to be cognates. English is sortof like Latin's niece.

However, English fraternity and French fraternité can't be said to be cognates. French fraternité is inherited from Latin fraternitas, but English fraternity is not—it can't be, because English is not a daughter of Latin. English fraternity is a loanword.

Japanese and Korean didn't inherit shinsetsu and ch'in ch'e from the same source that Chinese inherited qin1qie4. They aren't even known to share a common ancestor at all. Japanese and Korean just borrowed the word. --Ptcamn 20:06, 5 August 2006 (UTC)

Actually, most Sino-Japanese and Sino-Korean words are thought to be borrowed from Middle Chinese. Mandarin is descended from Middle Chinese. Since shinsetsu, ch'in ch'e and qīnqiè all descended from Middle Chinese, it feels wrong to me put:

==Mandarin==
===Etynology===
blah blah
====Loanwords====

Japanese: 親切 (しんせつ, shinsetsu)

This would imply that shinsetsu is borrowed from Mandarin, which cannot be proven definitavely. Mandarin did not come into its own until the Yuan Dynasty, but most Sino-Japanese words came into Japanese much earlier. This word is probably not the best example, because the earliest use of the word shinsetsu that I can find in Japanese (from a quick search on-line) is from Gakumon no Susume ("An Encouragement of Learning"), written by Fukuzawa Yukichi between 1872-1876.

If the label ====Cognate==== turns out to be problematic, what about ====Cousins==== or ====Relatives====? A-cai 00:39, 6 August 2006 (UTC)

How about just:

==Mandarin==
===Etymology===
From Middle Chinese (whatever), from Old Chinese (whatever). Compare Japanese (whatever) and Korean (whatever).

? I think putting each on a separate, bulletted line would be a bad idea, because with some widely-borrowed words it could get very, very long:

==French==
===Etymology===
From Old French (something), from Latin caballus (originally "nag", later any horse).
====Cognates====

Apalaí: kawaru
Asturian: caballu
Catalan: cavall
Cebuano: kabayo
Tumbalá Chol: cawayu'
Esperanto: ĉevalo
Friulian: cjaval
Galician: cabalo
Mbyá Guaraní: kavaju
Ido: kavalo
Interlingua: cavallo
Irish Gaelic: capall
Italian: cavallo
Karipúna Creole French: xuval
Ladin: ciaval
Yosondúa Mixtec: kuayu
Classical Nahuatl: cahuayoh
Isthmus-Mecayapan Nahuatl: cahua̱yoj
Novial: kavale
Occitan: caval
O'odham: kaviyu
Sayula Popoluca: cawa̱yu
Portuguese: cavalo
Romanian: cal
Romansh: chaval
Sardinian: cabaddu
Seri: caay
Spanish: caballo
Tagalog: kabayo
Welsh: ceffyl
Xavánte: awaru
Yatzachi Zapotec: cabey
Zoogocho Zapotec: cabayw

===Noun===
cheval m (plural chevaux)

horse

And this list is not nearly complete, and I'm not even counting derivatives like English cavalry, chivalry, cavalier, cavalcade... --Ptcamn 13:32, 6 August 2006 (UTC)

User:CORNELIUSSEON

This user has contributed for some time now. All his definitions seem to be copy/pasted from various US military handbooks. This may or may not be copyvio but it seems to me to be against the spirit (ethos?) of Wiki. What do you think? Also, when a word already exists he just adds his definition to the end even though it is the same as the first definition but using different words - I find this confusing but am not confident to merge them. Παρατηρητής

I don't see any problem with importing public domain sources. In fact, I think it should be agressively encouraged (with proper attribution, of course.) The US Government texts are of course, public domain sources. But they also have sometimes peculiar subtle wordings. I would think that most of the time, the separate line definition is therefore appropriate. Can you point to some specific examples, please? --Connel MacKenzie 09:38, 5 August 2006 (UTC)

unnecessary adjective senses?

Oftentimes, a noun seems to function as an adjective:

a Barbary pirate
a Mafia lawyer
a San Francisco native
a Wikipedia project

I remember Stephen Pinker talking about these in The Language Instinct. He made fun of elementary school grammar teachers for claiming that these words therefore were adjectives, believing as they did that a Noun Phrase was always <article, noun> or <article, adjective, noun>.

If I remember correctly, a better explanation (i.e., a better way of parsing these examples) is that a Noun Phrase can also be <article, noun1, noun2>, where noun1 merely acts adjectivally but doesn't take on the full mantle of actually being an adjective. (Otherwise, virtually every noun in the language would have to have an adjective sense listed.)

Anyway, the reason I bring this up is that I just noticed that we have an adjective sense explicitly listed at sister, as in

a sister city

I believe this is unnecessary, per the arguments above. I think it's not necessary to add adjective senses for cases like these, and that it's probably even worth deleting those that exist, to reduce clutter. (For sister cities, it's probably worth adding an entry for the derived term sister city.) What do other people think?

—scs 13:07, 5 August 2006 (UTC)

Unless I'm extremely confused, nouns are said to be used attributively in such cases. This of course doesn't make them adjectives. As such, there shouldn't be any mentioning of these in our entries. — Vildricianus 13:51, 5 August 2006 (UTC)

Tell that to the editor repeatedly attacking Bermuda, please. --Connel MacKenzie 20:32, 12 August 2006 (UTC)

I agree, and note that we have many entries with such overspecification.

<gripe>If we allowed use of the categories Category:English nouns and Category:English adjectives, it would be easy to browse through Category:English adjectives to spot such instances. By denying use of those categories, we make such review and correction much more difficult.</gripe>

Rod (A. Smith)

So, since there is now a demonstrable use of those categories, would anyone object to my adding them to {{en-noun}} and {{en-adj}}? Rod (A. Smith) 15:53, 6 August 2006 (UTC)

Yes. Please first review all entries in these categories. All hard-coded references to them should be removed, afterwards populating them only via the template. Or, that's what would be easiest. — Vildricianus 18:56, 6 August 2006 (UTC)

That particular difficulty is an unavoidable consequence, I think, of the fact that our data structure is too loose to adequately support some of things we're trying to do (if, indeed, it can be called a "data structure" at all).

The fact that a particular word has noun, verb, and/or adjective senses is already mentioned on a page -- but it's not specified in a way that makes it particularly easy to automatically parse. Explicitly adding words to part-of-speech categories would make the automatic processing easier, but would make page maintenance harder. Furthermore, explicitly adding words to part-of-speech categories would mean that the part of speech would always be specified in two different ways in each entry (i.e. by the part-of-speech header and the presence of the category), and from a data structuring perspective, such redundancy is always suspect.

(This is merely an observation, not an attempt to reopen what I suspect may be an old argument.)

—scs 16:08, 6 August 2006 (UTC)

My suggestion (which was reverted way back when) is to add Category:English nouns to the one and only (eventually) ubiquitous English noun template: {{en-noun}} and to do the same for English verbs, adjectives, and adverbs. Doing so will not increase maintenance efforts per entry, because all English noun lemma entries should have {{en-noun}} anyway. Rod (A. Smith) 16:20, 6 August 2006 (UTC)

<gripe> As a programmer, it seems silly to me that we have both part of speech headers and a template indicating part of speech; if there were an object that represented an entry's part of speech, surely we could modify the base object to provide feedback as to P.O.S. Requiring the same data to be specified multiple times violates good object oriented design. But I haven't looked (and don't plan to) into our codebase, so I'll shut up. </gripe>

If it is the intent to add these templates to all words for which the are appropriate, perhaps it would make sense to note that in either Help:How_to_edit_a_page or Wiktionary:Entry_layout_explained? Jeffqyzt 18:38, 6 August 2006 (UTC)

Yes, preferable to duplicating the language and POS info would be to move the headers into the language-pos-headword-inflection templates, but MediaWiki 1.8, upon which en.wikt is based, lacks certain featuers that we'd need to maintain section editing with such a solution. In this style of Wiktionary, such duplication is necessary. :-(

(unindenting for space and visibility...)

(...unindented for space and visibility) So, working with the above constraints, I'll write something to strip Category:English nouns from everywhere, request bot status to execute it, execute it, add the cat to {{en-noun}}, and reiterate for the other main English POS unless anyone objects in the meantime. (wishing we had wikidata) Rod (A. Smith) 20:55, 6 August 2006 (UTC)

If interested, see the proposed approach documented at WT:GP#Moving POS categories into POS templates. Rod (A. Smith) 23:16, 6 August 2006 (UTC)

I suggest including these where it is clear that the noun is used attributively as well as in its usual form (for example, for many well-known place names, such as New York or London : "the New York skyline"; "a London bus"). For myself, I keep attributive senses under the noun section, and use the label (as a modifier) or (attributively). The latter is the practice of some other dictionaries (such as the OED). Claiming that these are adjectives blurs the meaning of "adjective" and claims that such words have a grammatical function that they do not actually have. — Paul G 09:23, 8 August 2006 (UTC)

In my opinion, there should be great scrutiny of any adjective that does not have comparative or superlative forms, purportedly. Among the ones so listed are periodic, mental, ubiquitous, native... what the heck? These are obvious errors, and many others are questionable cases. Given that a word is an adjective, by default we should assume that more/most are acceptable modifiers, and as with the attestation process the burden of proof should be on "no comparative or superlative forms". DAVilla 03:38, 14 August 2006 (UTC)

Hmmm... unique can not (properly) be used with superlatives, yet is clearly an adjective. However, I believe things can be more or less ubiquitous. bd2412 T 15:27, 17 August 2006 (UTC)

I'm not trying to get them all thrown out. There are a couple of mathematical terms for which more/most makes no sense at all. I'm just saying there should be greater scrutiny. If it's claimed that an adjective has no comparative or superlative forms, more likely the claim is wrong, or the word isn't an adjective (in the English language sense) to begin with. DAVilla 22:18, 18 August 2006 (UTC)

presumptious

Why does this page redirect to presumptuous (the correct spelling)? Either it is a common misspelling (Google has 147,000 hits, which suggests it probably is) or it doesn't belong here.

As I seem to have answered my own question, I'm going to make this a misspelling page. — Paul G 06:50, 6 August 2006 (UTC)

OK, we have no objection to stream of consciousness ramblings! SemperBlotto 07:16, 6 August 2006 (UTC)

Ouch. Have I been misspelling it my entire life? --Connel MacKenzie 20:40, 12 August 2006 (UTC)

User:Dangherous

Moved to WT:A#Desysop request started

History

When were compare Histories & hit the button, the list of list of versions should still show up.68.148.165.213 19:24, 7 August 2006 (UTC)

That is a fine suggestion. For heavily edited entries, there might be some layout problems with that, where only the ten nearest are shown, perhaps. Have you filed anything on bugzilla: about it yet? --Connel MacKenzie 04:46, 8 August 2006 (UTC)

CheckUser run on all sysops

Discussion moved to Wiktionary:Votes/2006-08/CheckUser run on all sysops.

Rhyme formatting

Many of the existing Rhymes: links, and the new ones the RhymeBot had been adding up until this morning, looked like this:

Pronunciation

enPR: pâr
(deprecated use of |lang= parameter) IPA^(key): /peə(r)/
Template:X-SAMPA
Homophones: pair, pare

Rhymes: -ɛə(r)

Paul pointed out, however, that WT:ELE does not specify that extra indentation, such that they should really look like this:

Pronunciation

enPR: pâr
(deprecated use of |lang= parameter) IPA^(key): /peə(r)/
Template:X-SAMPA
Homophones: pair, pare
Rhymes: -ɛə(r)

Do people have opinions on this either way? Please comment on:

The extra indent for Homophones and Rhymes links is poor and should not be used.
The extra indent for Homophones and Rhymes links looks good. WT:ELE should be revised to specify it.
I don't care either way.
This is important enough to standardize that the RhymeBot should go back and fix all the ones that don't conform to whatever we decide here.

—scs 13:28, 8 August 2006 (UTC)

It was me who started indenting homophones and ryhmes. The reason I did this was because many words have several pronunciations. Each pronunciation can be represented in various schemes which are all equivalent and each has its own homophones and rhymes. Without the indenting it's not possible to tell if there is a structure or if the various pieces have just accumulated over time in no particular order. This reasoning is not immediately apparent on words with a single pronunciation but some type of structure is essential for those with several. PS when I came up with this design I also put all pronunciation schemes together on a single line but others felt this was wrong and undid that work in most cases. The result is an ugly format that nobody likes, and not just for rhymes but for the whole pronunciation section. Sooner or later we'll have to give it a good solid redesign. — Hippietrail 13:55, 8 August 2006 (UTC)

Rewriting bot policy

Wiktionary talk:Bots

Comments or a simple support appreciated. — Vildricianus 20:19, 8 August 2006 (UTC)

Limited sysop access

Hello. I'm one of several users maintaining the MetaProject on open proxies (see local chapter), which coordinates the blocking of open proxies and their unblocking upon being closed. Wiktionary is somewhat behind other projects in doing so; as of now, there are 80 open proxies listed for blocking on Wiktionary (see MetaProject and Blacklist). I'd like to have administrator access to synchronise Wiktionary with the MetaProject. This involves blocking new proxies as they are discovered and unblocking those that are subsequently closed; any other administrator tool will be left unused. My administrator access on Meta-Wiki, Wikisource, and Wikipedia (see confirmation) may be a reference to my good behaviour. :)

However, I'd like to discuss the prospect before requesting it. I'm not aware of any precedent on Wiktionary; a current Wiktionary administrator may decide to clear the backlog and regularly maintain synchronisation instead. // Pathoschild (_editor / ^talk) 04:17, 9 August 2006 (UTC)

Hi. Yes we block open proxies here. I try to check WT:OP here. I ran my script to auto-block the open proxies and tor exit points serveral months ago, so I'm not surprised to hear that we are now out of date. As of right now, I haven't found the time to sync the list with the meta additions. Nor even the Wikipedia updates.

Wiktionary has traditionally discouraged giving sysop status only for vandal-fighting type activities. But if I see you listed on WT:A you'll get my vote. (Don't you have developer access, anyway?) --Connel MacKenzie 17:17, 12 August 2006 (UTC)

I have neither developer nor steward access; I'm an administrator on a few sister projects, but that doesn't affect my access here. :) // Pathoschild (_editor / ^talk) 02:37, 13 August 2006 (UTC)

Maintenence templates list

I'm not exactly sure where to put this, but, eh? Anyway, I put together a list with in-action examples of all the maintenence templates in Category:Maintenance templates. Any ideas, thoughts, etc. would be greatly appreciated. Foxjwill 23:04, 9 August 2006 (UTC)

We used to try to just keep WT:I2T up to date. Categories were avoided early on, as they consuse the template "code" a bit. But our simplicity guidlines kindof went out the window when we got parser functions.

So, yes, any cleanup you feel like doing should be helpful, and is appreciated. --Connel MacKenzie 18:37, 12 August 2006 (UTC)

Ok, so I'll put a link to it on the top of the page. Foxjwill 22:43, 12 August 2006 (UTC)

Is Wonderfool a part of the Anonymex Wikipedia vandal gang?

After all of the chaos Wonderfool caused here, I remembered something on Wikipedia that filled me with a sense of dread. I did some searching and found a dossier for a possible vandal gang which had a rumor that it was going to try to do the same thing to Wikipedia. Here is the dossier for the Anonymex vandal gang on Wikipedia. The rumor listed reminded me of what happened here. Could Wonderfool be a member of this gang, and if so, could someone place a warning on w:WP:AN or w:WP:AN/I about what happened here, so that we could foil this attack before it starts on Wikipedia? He may be using Wiktionary as a practice run for learning how to attack Wikipedia in the same manner. It might also be a good thing to ask for a CheckUser on Wikipedia at w:WP:RFCU on Wonderfool's IP address to see if he has any accounts there, and to have him banned before he tries the same thing there. I am not an administrator here (nor Wikipedia for that matter), so I only saw the damage done, not what happened on the inside. Therefore, I do not feel that I would leave comments detailed enough for administrators on Wikipedia to know what to do. Jesse Viviano 03:12, 10 August 2006 (UTC)

If this was a coordinated effort, the only key thing he/they did, was time the festivities to be near the height of WikiMania. Offhand, I can't think of anything (outside of normal sysopping) that any Wikipedia sysop could do, even if they knew in advance something was about to happen. The average Wikipedia response time to any incident is tremendously better than en.wikt:'s. Where we had to goof around, the Wikipedia cabal has direct phone numbers to various key people.

That said, I will try to get a notice onto w:WP:AN today or tomorrow. --Connel MacKenzie 04:33, 10 August 2006 (UTC)

I will try to get something coherent posted on w:WP:AN soon. But having just read the Anonymex link above, I'd have to say that Wonderfool is certainly not related in any way to that vandal, from what I can see. --Connel MacKenzie 08:13, 11 August 2006 (UTC)

Minnan-ascii-bot (revival)

(To Mr. Mackenzie, Mr. Ullmann, Mr. Blotto, etc.)

Now I'll start this issue again, this is not discussed already for a long time. Almost everybody has agreed with this but there are two comments already about using templates in searching:

A-giâu: Actually they might be right, though it still needs to be empirically proved. Even if that's true, I can still see a bot inserting ASCII into articles that don't have them.
A yao: Longer by one process. If you use Minnan-ascii-bot, you just type the Min Nan ASCII in the Search box then there you go! But if you use templates, you still need to type the word then you still have to click the entry.

Please check with A-cai what's his opinion. I will even ask the members of the Min Nan wiki if they'll agree with templates or Minnan-ascii-bot. But in my opinion, Minnan-ascii-bot is the quickest way to search. A yao 14:13, 10 August 2006 (UTC)

First, it has not been anything like a long time. But mostly: I'm sorry, I simply do not comprehend what can possibly be wrong with entering sû-tián in the search box if that is what you are looking for? Surely you if you write in Bân-lâm-gú every day you have no trouble keying sû-tián? I mean, how do you write email? Send IM messages? Write filenames and edit text? Doesn't your computer let you do that? (I know Windows used to be impossible, but haven't they fixed that in the decade since UTF-8 became the standard? I'm using a borrowed Windows XP laptop right now, but Ubuntu Linux runs UTF-8 native.) On WinXP: Move mouse to search box. Left click. Type on keyboard s ^ u - t i ' a n. Search box reads "sû-tián". Move mouse to "Go" button. Left click. See entry for sû-tián in Bân-lâm-gú. Wonder yet again what the problem can possibly be? Robert Ullmann 21:14, 10 August 2006 (UTC)

Robert, I responded to the issue of why people would want to use numbers and letters in the original discussion. You had offered to work on an indexing scheme to accommodate this need. If such an index were in fact built (and worked), I think it might make the Min Nan ascii bot unnecessary. As you know, I tried one of your suggestions (about typing the letters and numbers into individual entries). What a nightmare, far too much work on the part of the person making the entry! That is definately NOT the solution!

A-cai 23:00, 10 August 2006 (UTC)

I still don't get it. Why would you or anyone need or want to look up "su5-tian2" when you can easily go directly to sû-tián or 辭典? (in the latter case, for example, type c i 2 d i a n 3 , with the Pinyin IME turned on). If it is too much trouble to put "su5-tian2" in the entry, why would anyone type it? (You do work with the appropriate Input Method Editor don't you? If anyone else wonders what this is, look at, e.g., this WinXP doc page.) Robert Ullmann 09:10, 11 August 2006 (UTC)

I am leaning towards agreeing with User:Robert Ullmann that ay5 and other 'words' do not belong in the main namespace. However, I also agree with User:A-cai that it is highly useful to have redirects for them (somewhere; not necessarily the main namespace, though), to direct users the actual entries. This may be entirely unworkable, but one idea is : could you make a robot to add the indexing or whatever that Robert Ullmann is suggesting, automatedly? It would (to my understanding) mean no more work for humans than the operation of current robot would. Beobach972 02:53, 11 August 2006 (UTC)

Ok, I will use a less obvious example. If I'm a beginning student of Min Nan, would I know which keystrokes would reproduce seⁿ-ji̍t (seN-jit8 生日: birthday)? I'll answer my own question: I don't even know (I'm good at cut-and-paste:), and I speak Min Nan! I repeat, Min Nan is NOT generally written (with the exception of a handful of scholars and some Wikipedia enthusiasts). Furthermore, I'm not even certain how many non-academic Min Nan speakers have ever even heard of POJ. Am I correct in pointing out that dictionaries are not generally for people who already know stuff? :) If I were a native speaker, I doubt I would ever have reason to look up a basic word like birthday.

But that is really beside the point, other on-line Mandarin and Min Nan dictionaries offer this service. We need to accommodate the needs of more than just the specialist who probably doesn't have a need to look up the information in the first place. I'm not trying to advocate any particular solution. I'm saying that this is a legitimate requirement.

A-cai 09:44, 11 August 2006 (UTC)

Actually, never mind. I'm not going to bother. (you can read deleted comment in the history if you care to.) Robert Ullmann 21:59, 11 August 2006 (UTC)

Robert, I think we may need to table this issue until we get feedback from a lot more contributors. If you look at the history tabs of just about every word in Category:Min Nan, you'll note a single user (yours truly). I'm not sure we can even have a rational debate with only ONE main contributor! What I can tell you is my own habits when searching for a Min Nan term in an online dictionary. Like many non-native speakers of Min Nan who also speak another dialect (usually Mandarin), I either look up the term by the Chinese characters (生日), or phonetically without the tones (seN-jit). Only if I'm absolutely certain of the tones would I type seN-jit8 or seN1-jit8 (Min Nan tones are far more difficult to master than Mandarin tones). If I ran across something online that I could copy and paste, I would copy seⁿ-ji̍t and put it into the search box. In fact, one of the reasons that POJ fell out of favor among Taiwanese academics was the difficulty of typing it on the computer. In Taiwan, TLPA has slowly been gaining ground. For readers who are seriously interested in Min Nan, take a look at 華台英詞彙句式對照集 (A comparison of vocabularies and sentence patterns of Mandarin, Taiwanese Min Nan and English; →ISBN). It was published in 2004, and uses TLPA exclusively. IMO, this is not a fact we should ignore since most of the best recent scholarly works on Taiwanese Min Nan have used TLPA (senn¹ jit⁸). Unfortunately, TLPA is only a few years old, and has not been embraced by Min Nan speakers outside of Taiwan. I personally like Revised TLPA (seN¹ jit⁸), but it is even less well known than TLPA.

A-cai 23:35, 11 August 2006 (UTC)

BTW, IPA would be a great solution for Min Nan (take a look at 生日 now!). However, it took a lot of extra work for me to put all of that in there (compare the time stamp for this post to the my last, that's about how long it took). I put (hybrid) for pronunciations that mix and match Zhangzhou and Quanzhou accents. This a common occurrence in Xiamen, Taiwan and Southeast Asia. I'm not sure (hybrid) is correct for this situation, if anyone has a better suggestion, let me know.

A-cai 01:10, 12 August 2006 (UTC)

At zh-min-nan.wikt, we do not have as clear a guideline as it is here. The redirect thing was just done since someone pointed out the case error thing. I am just trying to clean up the problem. Since I found out here, that there may be future problems with lines after the redirect line, I removed the redirects (I am not finished yet though). I apologize if we are becoming a little annoying. We just feel that there is a need for beginners to have an easier way to search Min Nan words. I am a cut-and-paste person myself like A-cai, hope that in the future, POJ can be typed easier on computers.

As of today, three options I see:

Redirect on a different namespace. Which I initially wanted.
Current search method. Which has been suggested.
modified search method. Maybe there is a way that the search result will display a link to POJ or Pinyin.

example I search for su5-tian2 and the search will look for something like  or  and then display the search result for both the POJ and Pinyin. If only one exist, then redirect directly to the word, which in this case is sû-tián. The  tag can be added at the Template:nan-poj-something as .

I know that you have a lot of things to think about hear at en.wikt and such can be stressful, so again I apologize. We just want to know if there is a better solution for this problem.

-- Hiòng-êng 15:55, 13 August 2006 (UTC)

Hiòng-êng, try typing bo5kang5 or bo5-kang5 into the search box, and then hit the search button. You should see:

bô-kâng
Relevance: 69.8% - 0.4 KB (40 words) - 04:10, 13 August 2006
bô-kâng-khóan
Relevance: 61.1% - 0.1 KB (7 words) - 04:19, 25 June 2006
bô-kāng-khóan
Relevance: 61.1% - 0.1 KB (7 words) - 04:20, 25 June 2006

The reason that this works is that I use something like ] for sort purposes within each category. However, we still need a solution for when we type bokang or bo-kang. Interesting side effect, if I type bo1kang1 or bo1-kang1, I still get the above results (so the number is ignored except as a placeholder by the search box) A-cai 23:21, 13 August 2006 (UTC)

User Problem

User:Connel MacKenzie has been rolling back my changes to articles, claiming they're copyvios. He hasn't said where he thinks they're from, and they aren't copyvios. The etymology of a is straight from Wikipedia and another is just a quote from Shakespeare. Shouldn't there be some sort of proof provided before he destroys my hard work? Thanks.—Aaaaaaaah! 02:17, 11 August 2006 (UTC)

If you had the opportunity to spend time on vandal patrol, you'd understand why it's hard to strike the perfect balance between keeping bad edits (copyvios, inaccurate translations, and sometimes subtle vandalism) out and allowing all good faith edits in. Copyright violations are a serious threat to WikiMedia projects and have even recently forced an entire WikiMedia project to be shut down. Exhaustively researching each edit can be prohibitively time-consuming when reviewing several hundred edits, so editors often must rely on gut feelings. Sometimes, unfortunately, those gut feelings are bound fail and either let bad edits in or (temporarily) revert good edits. In your situation, your huge edits with excellent structure and grammar understandably appeared to be likely copyvio.

Anyway, it's frustrating to have your work reverted, but since your edits remain in the entries' histories, your version should be easy to recover. When you revert to your version (open the old version from the entry's history and click "edit"), please note your source for huge edits to help reviewers understand that your contribution is legitimate. If your explanation requires more space than the edit summary allows, feel free to use the entry's talk page. Rod (A. Smith) 05:08, 11 August 2006 (UTC)

Last I checked, Wikipedia was overwhelmed by Primetime copyvios too. --Connel MacKenzie 08:28, 11 August 2006 (UTC)

Last time I checked, Primetime was still creating lots of sockpuppets too. User:Tthh = User:Coldfru355 = User:Aaaaaaaah!. --Connel MacKenzie 15:21, 11 August 2006 (UTC)

when you say = do you mean CheckUser = ? Or is there some other basis? Is there a reference to this particular on wikipedia? more curious than anything else Robert Ullmann 20:15, 11 August 2006 (UTC)

Yes, CheckUser. --Connel MacKenzie 17:07, 12 August 2006 (UTC)

Who is Primetime?
What is the name of the user who told you I was Primetime?
Thanks.—Aaaaaaaah! 18:47, 13 August 2006 (UTC)

I reverted yet another attempt by this user to put the alledged copyvio text into a. He put it back, with comment "wait until MacKenzie answers my question". (Side comment to "Aaa..." or whatever you call yourself today: look Ricky, this is getting really boring. I understand that the etymology of "a" in the wikipedia was originally from the 1911 Britannica, and has been edited many, many times. Do you understand that your repeated copyvios discredit everything single thing you contribute?) I really don't need to be part of this. Robert Ullmann 20:08, 13 August 2006 (UTC)

(Ricky steps in to prove my point ...) Hi Ricky, you just added to a:

In English the letter a (which, as a letter, is pronounced /ā/, except in a technical linguistic context, in which it is pronounced as the cardinal vowel /ä/) has a wide gamut of phonetic realizations. These range from front half-close vowels over the central ("mute") vowel /ə/ to back low vowels; compare the values of the letter a in the words many ( or, with a more open vowel, ), name (), care (), add ( or ), and cat (), all with front vowels; husband () and career (), with the central vowel /ə/; sofa (American English , but British English ; and all and wash (British ), with back vowels.

The letter a also occurs in combinations of vowel signs, namely aa (only in names or words of foreign origin, such as Spaak, aardvark, and kraal), ae (in words of foreign origin), ai, ao, au, and ay. Such combinations may have various phonetic values. For example, au is pronounced /ä/ or /ô:/ in austere; /ä:/—or, in the United States, /ǎ/—in aunt; /ô:/ in gaunt; and /ā/ in gauge. The combination ae has a different realization in words containing the element aer- (aerial, aerodynamics, aerospace), in which ae is pronounced /ā/, and in words allowing the alternation with the simple vowel sign e, such as aetiology/etiology or encyclopaedia/encyclopedia, in which ae (or simple e) is pronounced /ē:/. In combinations of vowel signs in which a is the second letter, the function of the a is to mark the length of the vowel corresponding to the first vowel sign, as in cease (, or to constitute a specific glide together with the preceding element, as in boat ().

Anyone care to guess the print source from which that was copied? (Anyone who can write "What is the name of the user who told you that?" is -- in my opinion -- utterly incapable of composing the prior two paragraphs!) Robert Ullmann 20:36, 13 August 2006 (UTC)

New words for inclusion in Wiktionary

I think we should mine the following for neologisms that have recently been included in print dictionaries:

Note that it's likely that not all of them would pass our CFI.

By "mine" I mean the words themselves, not, of course, the definitions given there. For now, those words not already in Wiktionary can be added to the Requested articles:English page. — Paul G 10:35, 11 August 2006 (UTC)

See also and many more pages obtained by Googling on the phrase "new words" (as if we haven't got enough to do already!) SemperBlotto 11:02, 11 August 2006 (UTC)
- I have compiled them, for now, at User:BD2412/new word list. Feel free to pluck and cull. bd2412 T 15:00, 11 August 2006 (UTC)

Thanks, SemperBlotto and BD2412. I'll copy these to Requested articles. — Paul G 09:00, 12 August 2006 (UTC)

Added. — Paul G 09:14, 12 August 2006 (UTC)

Half-truth.

Moved to WT:TR#half-truth. Rod (A. Smith) 17:19, 12 August 2006 (UTC)

Logo logo

meta:Wiktionary/logo

Since you forgot... --Connel MacKenzie 20:34, 12 August 2006 (UTC)

Policy request: Etymolgy

I've noticed that etymologies in articles span a large range of amounts of detail, abbreviations, etc. Since most people will not be familiar with everyone's methods of writing etymologies, why not create a policy standard for among other things

length/detail and
abbreviations (i.e. < vs. f. vs. fr. vs. from).

Foxjwill 22:41, 12 August 2006 (UTC)

The accepted practice is to use the "short" format, newest to oldest. If you'd like to take a stab at writing the first draft I'd appreciate it. It might be better named a "guideline" around here, though. Or even, a "proposed tentative guideline recommendation"? :-) --Connel MacKenzie 22:54, 12 August 2006 (UTC)

Here's the first draft you asked for: User:Foxjwill/Etymology guidelines proposal. Foxjwill 00:02, 13 August 2006 (UTC)

This isn't a paper dictionary so what's the need for abbreviation? Visual clarity in the translations section is fine, but f. or fr. could easily cause confusion. DAVilla 23:56, 13 August 2006 (UTC)

Note that there is already a guideline page at Wiktionary:Etymology. It didn't yet have any policy status, so I tagged it just now as a draft policy. Please, if possible, incorporate your suggestions into that document. Rod (A. Smith) 00:35, 14 August 2006 (UTC)

Community portal to do list

I just finished putting together a Wiktionary to do list for the Community Portal based off Wikipedia's. I sincerely urge you to give your thoughts and ideas on it 'cause Rome wasn't built by one person. Foxjwill 22:52, 12 August 2006 (UTC)

Another one! We must have over a dozen floating around now. As much as I tend towards reinventing the wheel, we should probably thing of the right place to agglomerate these. If the various notices aren't prominent enough, we should think of where (better) they belong. --Connel MacKenzie 22:58, 12 August 2006 (UTC)

Wow. I didn't realize there were that many. I agree with having a place to conglomerate these types of things, though. Foxjwill 23:30, 12 August 2006 (UTC)

Languages without ISO 639 codes

The criteria for inclusion page reads:

Uncommon languages are acceptable as long as they are (or were) used for everyday communication by some identifiable, natural population of humans. If the language lacks an ISO 639 language code language code, it is almost surely not acceptable.

But there is a large number of languages that are or were used for everyday communication by an identifiable, natural population of humans that lack ISO 639 codes, mainly due to ISO 639-3's focussing on living languages and excluding many dead languages.

Eclecticology commented on the talk page:

The importance for having this is to avoid treating local dialects as a separate language when the proponents insist that it is a separate language.

Actually, ISO 639-3 is based on Ethnologue, which treats many of what most linguists regard as dialects as separate languages. For example, there are no less than 35 different codes for variants of Quechua! So having an ISO 639 code does not seem like a wise requirement if that's your goal.

Things like Template:nav require a code. Just make one up for languages that lack one, or what? --Ptcamn 06:39, 13 August 2006 (UTC)

I wasn't aware that ISO 639-3 was so focused on only living languages. It includes, for example, "enm" for Middle English. To help explain the problem with using ISO 639-3 as a guideline, please list some languages that lack an ISO 639-3 code that you think we should include in Wiktionary. Rod (A. Smith) 09:15, 13 August 2006 (UTC)

Egyptian has a code (egy). And we have some words here. See (e.g.) pr-aA (they have to be under their pronunciation code, there isn't any effective Unicode support for the hieroglyphs yet, although this has been discussed for a long time (>15 years). A number of linguists avoid the term "dialect" because of the variant uses; using language group, language, and language variant. The Quechua language group has a lot of codes because they are mutually incomprehensible. The Chinese language group has more than are coded (14); some of this is political: the reason Mandarin, Cantonese, Min Nan and others were forced into one code (zh) in IS 639-1 is primarily that the PRC rapporteur insisted that there was only one China, with one language: PRC Standard Mandarin, in Standard (simplified) Script. Thus we have zh-tw ... So what languages are needed? Very curious. Robert Ullmann 11:24, 13 August 2006 (UTC)

Among Australian languages, there's at least Andjingith, Aritinngithigh, Barranbinja, Bigambal, Bungandik, Cheangwa, Dappil, Darkinjung, Dharuk, Dhudhuroa, Djirringanj, Gabi-Gabi, Giya, Gudang, Gundungurra, Guwa, Guwar, Karlamay, Kaurna, Kermain, Kok Narr, Kok Thaw, Kolakngat, Kuurn-Kopan-Noot, Luthigh, Maljangapa, Manbara, Marrett River language, Mbiywom, Midhaga, Minkin, Mirning, Muk-thang, Nana-karti, Natingero, Ngardi, Ngarigo, Ngaro, Ngayawang, Ngaygungu, Ngkoth, Nhanta, Ogh-Undjan, Pallanganmiddang, Pirriya, Popham Bay language, Takalak, Thawa, Unwinjmil, Wadha-wurrung, Walangama, Wangka-yutjuru, Warndarrang, Wemba-Wemba, Witjaari, Wuthati, Wuy-wurrung, Yabala-Yabala, Yanda, Yangara, Yaygirr, Yinwum, Yirandhali, Yitha-Yitha, Yota-Yota, and Yuyu. I don't know about other continents. --Ptcamn 00:38, 14 August 2006 (UTC)

Wow. And here I thought Ethnologue was excessively comprehensive. (Where'd you get all those?) —scs 13:02, 14 August 2006 (UTC)

Yes, it's exhaustive. It includes many languages with fewer than 10 speakers. For any that we choose to include in Wiktionary, we can assign an ISO 639-3 local use code, i.e. any of the 520 codes from "qaa" to "qtz", inclusive. Rod (A. Smith) 16:49, 24 August 2006 (UTC)

List of Common Chinese Character used at zh.wikipedia

Can I create a list of Common Chinese Characters (zi) that's used in zh.wikipedia?' something like this ...

No.	stats	Character
1	1,420,772	的
2	987,530	年
3	460,732	一

My purpose. So that we can know which Chinese Characters we are suppose to concentrate on. adding entries if any article has not yet been created. If this is permitted, which namespace can I put it to?

I used a perl script that searched for all the Chinese Characters used in the Chinese Wikipedia, using the data dump "zhwiki-20060808-pages-articles.xml". Such data was processed using Linux's sort and uniq commands, and by another perl script that searched for the han characters in the list. The han list is based on "Unihan.txt" of the Unicode Website, from the list of KBigFive, KGB0 and KGB1. -- Hiòng-êng 16:06, 13 August 2006 (UTC)

This would be a very useful Appendix (or perhaps even Concordance? Is zh:wikt a "work of literature? ;-) I'd suggest you go look at those. How big is it? (Theoretically, about 15-20 thousand lines, two big to be one entry. In practice?) I don't think you are going to find very many entries to add, there are 17,791 apparently already there, but there are several reasons this may be more useful than you think. If you can break it into reasonable pieces, I don't think anyone will object if it appears as Appendix:Chinese Wikipedia Frequency (something), the worst that can happen is that we decide to delete it. Robert Ullmann 20:23, 13 August 2006 (UTC)

Actually, I used zh:w, so that those who wants to study the words here who are interested in reading zh:w would be able to do so. Yes, It has shortcomings, the list does not seperate trad and simp characters. This is because zh:w, I think, uses both forms depending on which country they belong, or character set they use. -- Hiòng-êng 00:29, 14 August 2006 (UTC)

Hiòng-êng, if I may offer an alternative: Please take a look at Wiktionary:Useful links/Chinese. On this page, I provide a link to the HSK list of 8840 words along with instructions for how to retrieve the entire list of 8840 words. The HSK (Template:zh-ts) list is maintained by the HSK committee, and is used to create the Chinese proficiency test that is used to rate non-native speakers who wish to study at Chinese universities. It divides each character or word (both Template:zh-ts and 字) into beginning, elementary, intermediate and advanced (甲乙丙丁). The HSK committee has based these ratings on several years of statistical analysis of usage frequencies of Chinese words (similar to the list you want to create). I have created a category on English Wiktionary called Category:Chinese by difficulty level so that words on the list may be easily identified. I think your idea is a good one. However, if you just want a list of words so that you know which words need to be worked on first, I would suggest the HSK list. By the way, the HSK list ranks 的, 年 and 一 as Beginning. If they are not already, they will eventually be listed in Category:zh:Beginning Mandarin (Pinyin), Category:zh-tw:Beginning Mandarin (traditional) and Category:zh-cn:Beginning Mandarin (simplified). I had been considering putting the HSK list into a Wiktionary appendix. I would have no objections if somebody did it for me:)

A-cai 23:03, 13 August 2006 (UTC)

One last note, if you're interested in a printed dictionary for HSK words, I recommend HSK汉语水平考试词典 (→ISBN). It also lists part of speech (a rarity for Chinese dictionaries) and provides lots of example sentences. A-cai 23:11, 13 August 2006 (UTC)

Can we use the HSK list without incurring a copyvio? -- Hiòng-êng 00:29, 14 August 2006 (UTC)

I'm not a lawyer, so I will only give you my own opinion. The HSK list has been used in numerous books, dictionaries etc., and is also available in its entirety online. My guess is the list per se is not copyrighted, but perhaps specific versions of the list might be (for example, if you copied it from a pretty table with certain definitions or something). I think it should not be a problem to simply have a list like:

Beginning
一

年

的

etc.
Elementary
etc.
Intermediate
etc.
Advanced
etc.

or possibly split it into four separate lists. Are there any copyright lawyers reading this that can help us out? A-cai 10:21, 14 August 2006 (UTC)

Hi, I'm a copyright lawyer. :-D I happen to have a book on intellectual property in China... somewhere. Need to find it. However, here's the Copyright Law of P.R. China, which says:

Article 5　This law shall not be applicable to:

(1) laws; regulations; resolutions, decisions and orders of state organs; other documents of legislative, administrative and judicial nature; and their official translations;

(2) news on current affairs; and

(3) calendars, numerical tables, forms of general use and formulas.

Since the HSK is ultimately a government document, I'd call it safe, but it can't hurt to splice in some words from other sources for a more "comprehensive" (and clearly transformative) end product. bd2412 T 16:26, 15 August 2006 (UTC)

BD2412, thanks for responding. One way we could distance ourselves from the original HSK list is by including a regional comparison of words in our version. Since the HSK committee is PRC based, the list tends to (understandably) favor PRC Mandarin. Nothing surprising there. However, we could add regional variations, for example: 自行车 (bicyle) is rated by the HSK list as beginning Mandarin, but 腳踏車 (bicycle) is not even on the list. This is because 自行车 is standard Mandarin, but 腳踏車 is only used in Taiwan. So in our Wiktionary version of the list, we could do something like:

Beginning
- 星期 (Taiwan: 禮拜)
- 自行车 (Taiwan: 腳踏車)
Intermediate
- 礼拜

That was just one idea I had. Do you think something like would distance it enough? Of course, we would have to start out with the raw list, then add the regional variations by hand. A-cai 22:29, 16 August 2006 (UTC)

Please take a look at Appendix:HSK list of Mandarin words. Hiòng-êng, would something like this be useful for you? BD2412, what do you think?

A-cai 13:56, 17 August 2006 (UTC)

Excellent! I like the list, but it makes for a huge page - I suggest following through with your earlier proposal to break it into four sub-pages. I've taken the liberty of adding the public domain rationale to the template. bd2412 T 14:48, 17 August 2006 (UTC)

No problem :) I will work on getting it into four separate lists. I will also try to take some time every now and again to more thoroughly proof the list (I noticed an error already). It is a large list, so it will take some time to go all the way through it, but I think it will be worth it to make sure it is correct. I'm guessing from what I've looked at so far, that there should only be a few minor errors here and there. On the whole, it should be fairly accurate.

If there are any other fluent Mandarin speakers out there that can help with the proofing, please go ahead. I used Dr. Eye to convert the Simplified into Traditional. So, I'm mainly looking to make sure that it did it in a way that will be helpful to Wiktionary users. For example, I noticed it converted 计划 to 計畫. Not technically incorrect, but I believe 計劃 is better, because it is easier to see the connection between 计划 and 計劃. I guess it's a little subjective on my part.

A-cai 15:18, 17 August 2006 (UTC)

Hiòng-êng: would you please add your list as an Appendix: this and the HSK list are not mutually exclusive alternatives; and I'd like to see your list. Note that it needs to be broken up somehow into reasonable parts; maybe you might start with the first 2000 or so as one entry? Would be very interesting. Robert Ullmann 19:23, 17 August 2006 (UTC)

I have split the Appendix:HSK list of Mandarin words into four separate pages. I also added section headers for each letter.

A-cai 00:10, 18 August 2006 (UTC)

Will do, I am currently figuring out my format at the zh-min-nan.wikt. I broken it up as of this time, by 1,500 words. The whole list would be about 15,000 words, but majority or half of thus are under 100x usage, majority of these are only used once. I will lump these once mentioned characters in say... two pages... those in between I will need to think it out first. The file at nan is zh-min-nan:Sek-ín:Hôa-gú Wikipedia Siông Ēng ê Jī -- Hiòng-êng 05:46, 18 August 2006 (UTC)

Very interesting. Do you plan to update the list periodically as people continue to contribute to Wikipedia?

For non-Min Nan speakers, the English translation to the link that Hiòng-êng provided: ] (])

A-cai 06:17, 18 August 2006 (UTC)

Probably yearly. I am using dial-up and zh.w takes lots of hours to download and I'm using pygmylinux (upgraded with some Slackware 7.1 components and gui)at a Cyrix 200 something(PII compatible, I think), which slows down, (I'm guessing here) my perl script which would takes as much time to process to create my list.

I would like to suggest the usage of columns in the HSK list. It has the advantage of making the file look shorter... at least at gui browsers.

-- Hiòng-êng 06:40, 18 August 2006 (UTC)

Good suggestion about the HSK list. I will put it on my list of things to do (unless somebody beats me to the punch)!

A-cai 12:36, 18 August 2006 (UTC)

Beat you to the punch! :-) bd2412 T 15:33, 19 August 2006 (UTC)

I have moved the file from Index to Appendix here. When I am finished with it, I will create the Appendix here. -- Hiòng-êng 03:45, 19 August 2006 (UTC)

This is an interesting development, though my inclination would be to suggest that the zh-wikisource material might be more appropriate for this. As the process becomes more sophisticated this could give a basis for the statistical analysis of texts, that reflects such things as the use of language in a given historical period.

Frequency data should be normalised. This makes the results independent of the size of the corpus. Thus "occurrences of the word x 100,000 / total words in text" will give the frequency of the word per 100,000 words.

The copyright issue does not mean only that the material is in the public domain, but that it is uncopyrightable. This may have as much to do with the fact that we are dealing with the information rather than the way in which it is expressed. Unfortunately, the HSK article in Wikipedia does not give any historical background to the issue.

Is there a reason why the lists are in Roman alphabetical order and not traditional Chinese character order? Eclecticology 23:17, 19 August 2006 (UTC)

Glad you asked. Yes, the HSK list is maintained by an official organization in the People's Republic of China. Pinyin is the official Romanization for Mandarin in the PRC. The default sort order for Mandarin words in the PRC is Pinyin sort order. However, if somebody were ambitious enough to create a separate page based on radical/stroke order, I would have no objections. If there are any enterprising PHP script programmers out there, perhaps a script could be written to allow the user to dynamically choose the sort order?

A-cai 01:12, 20 August 2006 (UTC)

Frequency List

I think using Wikisource would be better. But I do not know how to segregate the informations. If anyone can segregate for me the datas, or at least provide me the list of articles/page per historical period, then I will try to adjust my perl script to do the task at hand.

Unfortunately I'm not much help when it comes to technical problems; somebody more technically minded may need to experiment with this. At the beginning one might explore single articles by stripping out the meta material, and doing a sort and count on what's left. I even suspect that this may even be easier for Chinese characters than it is for English because of the limited number of characters available. Extending this for 2 character "words" could be a later step.

Regarding the normalization thing. Can you give example, I am not familiar with this concept. Sorry.
-- Hiòng-êng 01:22, 21 August 2006 (UTC)

Normalisation is a technique for comparing word frequency in texts of different length. A word that occurs 1532 times per hundred thousand words can be expected to statistically appear 306 times in a text of 20,000 words or 18,748 times in a text that is 1,237,451 words long. To calculate it you need to know the total number of words in your text. Eclecticology 04:36, 21 August 2006 (UTC)

HSK List

The Public domain thing. I think it can be used in any way, commercially or in open source. Am I correct here?
-- Hiòng-êng 01:22, 21 August 2006 (UTC)

You're probably right, but it would be nice to have enough historical background to be prepared for any argument that might come along. Eclecticology 04:17, 21 August 2006 (UTC)

Diagrams for kinship terms

I think it would be useful to have diagrams of family trees, with a particular link highlighted, to illustrate the meaning of kinship terms, especially those that have no simple equivalent in English. But I don't have a means to produce these myself.

Would anyone like to create these, and upload them to Commons? --Ptcamn 08:26, 14 August 2006 (UTC)

The appropriate category there would be Category:Family trees. --EncycloPetey 01:23, 20 August 2006 (UTC)

Redirects

Discussion moved to Wiktionary:Votes/2006-08/Redirects.

Wonderfool's sockpuppets on Wikipedia

Could someone familiar with the Wonderfool/Dangherous rampage please file a sockpuppet report on Wikipedia at w:WP:SSP or w:WP:RFCU? It seems that this user has some active sockpuppets there as well. One is w:User:Dangherous. Another possible sock could be w:User:Brandnewuser. By the way, this user's main username on Wikipedia changed from Wonderfool to w:User:Thewayforward. I do not know what happened in the rampage except for the deletion of the main page, so I do not think that I would be able to write a report there that will make an administrator know why these socks should be blocked. Vildricianus added the text "]" to the suspected sockpuppets' user pages and made a report at w:WP:AN/I, but those were not effective methods of reporting sockpuppets Please do not post sockpuppet accusations in w:WP:AN or w:WP:AN/I, because those locations are for problems that no other place exists to handle those problems, while places like w:WP:SSP or w:WP:RFCU exist for that purpose. Filing a report at w:WP:SSP can be lengthy, though. I have done this before, and this works. Thank you. Jesse Viviano 17:58, 16 August 2006 (UTC)

Done. w:Wikipedia:Suspected sock puppets/Wonderfool. --Connel MacKenzie 00:56, 17 August 2006 (UTC)

I changed the link above to make a permanent link to the sock puppet case on Wikipedia. All administrators who know about the Dangherous rampage and Wonderfool case in general, please add evidence to this page. Jesse Viviano 22:01, 19 August 2006 (UTC)

Newbie question

I'm new here. Am I to understand correctly that wiktionary has separate entries for every form of a word (e.g. dance, dancing, danced, etc.)? What is the purpose of this? Isn't this system pretty redundant? --Fang Aili 20:50, 16 August 2006 (UTC)

But these words may also mean things in languages other than English. Widsith 20:54, 16 August 2006 (UTC)

That doesn't make sense to me. Could you possibly direct me to the appropriate Wiktionary policy page about this? --Fang Aili 21:07, 16 August 2006 (UTC)

The policy is ‘all words in all languages’. Look at the page for sang. This is a word in English (the past tense of sing). But it also means ‘blood’ in French and Catalan, and ‘song’ in Danish and Norwegian. All of these belong on the one page, which is designed to show what this specific collocation of letters represents in every language. You may not think that danced has any meaning beyond that explained on the dance page, but we can't rule it out. Widsith 21:18, 16 August 2006 (UTC)

But by not posting danced on the dance page, are you not leaving out information? Or should the dance page, in theory, supply "see also" links to each form of the word? How about creating one dance page with all its derivatives (with redirects pointing to "dance"), and if "danced" means something else in another language, then that word get its own page? This system seems to create mounds of work where it is not necessary. But I guess I should stop talking about it because it's not like it's going to change. --Fang Aili 21:24, 16 August 2006 (UTC)

The rules/guidelines can be found at Entry layout explained. The "inflection line" following the part-of-speech heading is wikified to the other forms of the word. No redirects, is a pretty good rule of thumb, when it comes to this type of entry. --Connel MacKenzie 21:38, 16 August 2006 (UTC)

Unfortunately, Entry layout explained doesn't explain how to write the "inflection line". The simple example uses '''bed''' instead of {{en-noun}} for example. SemperBlotto 21:45, 16 August 2006 (UTC)

I've had a go at updating the simple example to show use of the en-noun template. I've also added a line encouraging links from the key words in a definition. --EncycloPetey 01:19, 20 August 2006 (UTC)

When a simple example begins playing around with templates it is no longer simple. It should be made clear that plain language is always acceptable for this.

As for the rationale behind having separate pages for the different inflections, one also has to assume that the reader has very little knowledge of the language that he's looking up. This is less of a problem in English than it is with more highly inflected languages. I have often looked up words in a paper German dictionary, and couldn't find them very easily because I was unfamiliar with the inflections. Eclecticology 03:58, 20 August 2006 (UTC)

I've no object to that, but we might consider having an advanced layout page that addresses the issues of templates and categories, at the least.

Nor do I disagree with the rationale behind separate pages for the inflections. However, I think we're approaching a point where we'll need to have a major discussion about format for non-lemmata entries in highly inflected languages (like Latin, Greek, and such). I'm not quite ready to open that discussion however. --EncycloPetey 03:12, 21 August 2006 (UTC)

Fair enough. Major questions should always be open to revision to pevent people from becoming fixated on one single way of doing things. As for the other languages, what is often most needed to develop those ideas is someone with a good understanding of the language in question. Eclecticology 03:25, 21 August 2006 (UTC)

CJKV templating

We now have (after much work with A-Cai, thank you!) POS templates for Min Nan and Mandarin, to be used under the POS headers in the Min Nan and Mandarin language sections of pages.

They are nan-(noun, adj, verb), and cmn-(noun, adj, verb, idiom). They categorize in the existing nan*: and zh*: categories, as well as in categories consistant with the structure for all other languages.

Cantonese, Japanese and Korean will follow soon. The Han character templates (Han char, Han ref) need a bit of work, but can be used in the Translingual section (the work will just change the way optional parameters are used, or not used). Sorry if this note is a bit terse, it is after midnight here in Nairobi. Robert Ullmann 21:45, 17 August 2006 (UTC)

I checked the file 天体. It uses cmn-noun template -- {{cmn-noun|s|pin=tiāntǐ|pint=tian1ti3|tra=天體|sim=天体|rs=大01}} which produced 天体 (simplified, Pinyin tiāntǐ, traditional 天體).

My problem is the pint option. I tried to search for tian1ti3 but was not found, maybe the file is just new, so it can not be found by the search engine yet. Or maybe there is a flaw in the template? The display does not show the pint nor the source htm(or at least I did not find it). If it is not supposed to be displayed, maybe it needs to be enclosed with  html tags?

-- Hiòng-êng 02:18, 25 August 2006 (UTC)

I think the search engine will not find it yet because I entered that word today, so it is too new. The pint option is for Pinyin sorting of categories, it was not intended that it should be shown in the entry itself. However, if you would also like to see the Pinyin with numbers in the entry, we could discuss about where in the entry it should be located. I definitely would not want it just below the part of speech with the rest. The reason is that once you get longer entries (ex. 先天下之憂而憂後天下之樂而樂), it becomes distracting. Try typing ci dian or ci2dian3 in the search box. It should find something. Note, if you try searching for cidian, it will return too many false hits. Also, if you type the wrong tones, it doesn't seem to matter. Search on ci1dian1, and you should get the same thing as ci2dian3. Perhaps, the wiki software people can fix this in the future. As far as the  tags, I will let Robert address that issue, since he wrote the template.

A-cai 06:29, 25 August 2006 (UTC)

I think it is just right that the numbered pinyin is not displayed. Let me re-phrase my question. Will the search engine find pint=tian1ti3 even if its not displayed in html form? If the search engine looks at the wiki source itself, then there is no need to use the  tags.

-- Hiòng-êng 07:08, 25 August 2006 (UTC)

I think it should find it, but I'm not sure. I'll let Robert answer. A-cai 09:34, 25 August 2006 (UTC)

The search function works on the wikitext, not on what is displayed. So, for example, searching on "simplified" will find pages that have that in the text, not every page that uses (e.g.) zh-forms (which would be useless). So the pint= option should make the tones forms available to the search. However, the numbers have an odd effect, as A-cai notes, and I haven't figured out exactly what the Wikimedia software does. Seems to me that tian1ti3 should be one word? I am looking into it. There may be wikimedia options not set correctly for en.wikt. Robert Ullmann 12:10, 25 August 2006 (UTC)

I asked a more technical version of the question on WT:GP. The en.wikt is not using the default behavior of MySQL FULLTEXT, which would work fine here. Trying to find out why and how. Robert Ullmann 12:52, 25 August 2006 (UTC)

Allow me to be bold and attempt to predict what users may be looking for with respect to Pinyin searches. I'm guessing that the vast majority of Westerners would prefer to lookup words in Pinyin without using tones, since many Westerners have trouble remembering tones. Native Mandarin speakers may be more comfortable searching with tones included because it is less vague. However, typing the diacritics is a pain, so they would probably want to use numbers to represent tones. In the case of the Mandarin word for dictionary, a native speaker might prefer to type ci2dian3 and hit search (actually, who am I kidding, They would probably just type the Chinese characters ;-), whereas a Westerner would most likely prefer to type cidian when searching. It is nice that Wiktionary allows users to search for ci dian and find the word. Ultimately, I think we will need to provide users with a way to type cidian and find the word. Would this involve a modification to the template or would it involve a modification to wiki software?

A-cai 13:19, 25 August 2006 (UTC)

Wiktionary Romanization standards for Japanese

Since Japanese entries do not specifically mention the method of Romanization, would it be appropriate to come to a consensus as to which Romanization should be standard? I think this would be good for the sake of consistency. Take the word 道場 (どうじょう) for example. Some people on Wiktionary will Romanize it as doujou, and some as dōjō.

I prefer dōjō (Hepburn), because that is probably the most widely accepted standard. I don't see many books that spell it doujou, despite the fact that doujou more closely matches the Japanese kana.

Am I being too nitpicky? Does anyone else have an opinion on the matter?

A-cai 14:28, 18 August 2006 (UTC)

I agree with the Hepburn..system. Using 'ō' instead of 'ou' is like using 'o' instead of 'wo', 'shu' instead of 'shyu' etc, in that it matches the actual pronunciation. Someone new to Japanese might not realize that an elongated 'o' isn't pronounced as if it were two different vowels. Ric | opiaterein 14:41, 18 August 2006 (UTC)

Very apropos, as I was just about to start working on ja-noun (note, not Template:janoun which is the more common right now.) Revised Hepburn would seem to be the preferred system. It would be helpful if anyone who notes a use of the janoun template that is not in a shinjitai entry (i.e. is in a kyūjitai entry), or doesn't have hiragana as the first parameter and romaji as the second, would tell me. Robert Ullmann 15:02, 18 August 2006 (UTC)

In my opinion, we shouldn't pick a preferred romanization system at all, any more than we should pick between American and British English. I say list all the romanization systems in use, in a ===Alternative spellings=== subsection. --Ptcamn 05:15, 19 August 2006 (UTC)

I've replaced Template:ja-noun with a completely new version. (!) It is/was only used in three places so far: 場, 時間, タイムアタック. The new version allows specifying the form of the headword and doing some things baased on that. I think janoun might be redefined in terms of ja-noun, but it depends on how many entries are not "shinjitai (hiragana, romaji)" which is the usual case. Tell me if you see any. Comments? Robert Ullmann 15:53, 18 August 2006 (UTC) I still need to do work in the category structure, e.g. this template puts a hiragana noun in category:Japanese nouns (sorted by hiragana/katakana) but not yet in category:Hiragana.

Robert, the word for library is a good example:

shinjitai	図書館
Simplified Chinese	图书馆
kyujitai/Traditional Chinese	圖書館

A-cai 00:58, 20 August 2006 (UTC)

Ptcamn, I agree that we should list all valid Romanizations. However, I believe that it is important that we either label the system of Romanization OR agree on a "default" Romanization system. Yes, in English we have color/colour. However, we also tell the reader that color is the U.S. spelling, and colour is the British spelling. If we didn't, non-native English speakers might get the impression that either may be used regardless of the circumstances (which I don't think is true).

I have always thought that:

東京 (とうきょう, Tōkyō)

is not a good idea. I think:

東京 (hiragana とうきょう, Hepburn Tōkyō)

or

東京 (hiragana とうきょう, Wāpuro rōmaji Toukyou)

would be better. We should not assume that everyone in the world that might come across a Japanese entry would automatically know that とうきょう is hiragana. For that matter, we should not assume that the average reader even knows what hiragana is (which is why I linked it to a definition page in the above example). I favor Hepburn as a default because, as has been pointed out in the Wāpuro rōmaji article:

One problem with the wāpuro-style representation of long vowels is that the distinction between 'ou' and 'oo' is not supported by Japanese pronunciation. The wāpuro-style distinction is based on kana usage, which represents the same sound (long 'o') as おう or おお, depending on the word. These different kana spellings are simply an isolated survival of historical usage. On the other hand, the kana spelling おう is pronounced in two different ways: as 'ō' in the meaning 'king' (王), and as 'ou' in the meaning 'to chase' (追う). Being based on hiragana, wāpuro style writes both these words as 'ou', ignoring the difference in pronunciation.

It's not a question of which is correct. It's a question of which we think should be the default. If we can't agree on a default, can we at least label which one we are using so that others may have a frame of reference? A-cai 15:19, 19 August 2006 (UTC)

We need a default so that we can category sort; as we use Pinyin for Mandarin, POJ for Min Nan, Jyutping for Cantonese, we should be consistantly using (Revised) Hepburn for Japanese. Alternate spellings as well in the romaji entries are of course fine. Robert Ullmann 15:44, 19 August 2006 (UTC)

That's not much of an argument in favour of Hepburn, A-cai. Wāpuro may not distinguish 王 and 追う, but Hepburn doesn't distinguish 刀 (とう) and 十 (とお). Wāpuro ignores a difference in pronunciation, Hepburn ignores a distinction in writing. Further, neither of them represent pitch accent. If you really want to know how words are pronounced, put it in IPA in the ===Pronunciation=== section. --Ptcamn 16:11, 19 August 2006 (UTC)

Ptcamn, touché! However, the above quote from the Wāpuro rōmaji article was only a minor supporting point. A better argument would be that Hepburn is and has been the de facto standard for Romanizing Japanese in much of the English speaking world since before WWII. Therefore, it makes sense to have it as the default.

A-cai 23:58, 19 August 2006 (UTC)

My inclination after reading all this is to make the Revised Hepburn Wiki-canonical for Romaji. Romanization is not pronunciation despite the affinity which those two concepts may have. This should not prevent "see" or "see also" references from other know variations of romanization. The ideal is to make information available to the greatest number of people with the least amount of difficulty. This involves trying to find a balance between the standards adopted by those most accustomed to dealing with the language, and making the language accessible to those who know nothing about it.

The only serious difficulty for the newcomer is the use of macrons. The compromise of replacing macrons with circumflex accents to put the Romaji within ISO 8859-1 does not solve the problem for those who have a complete phobia about diacritics. This may lead to some kind of disambiguation-like references on all pages which have variant accent patterns, whether for Romaji macrons or for any other language or transliteration. Eclecticology 20:31, 19 August 2006 (UTC)

I am strongly in favor of Revised Hepburn. The fact that it does not mirror the native spelling as well as Wāpuro is only a factor where the original script is not provided. Here we provide the original kanji and kana writing, and the Romanization is only for pronunciation. Anyone who needs to know whether a long ō is with う or お has merely to look at the original writing provided. Besides that, I have almost never seen Wāpuro used anywhere in my life; the Hepburn system is virtually universal. The only variations is whether to write ō or ô, and whether to indicate the accent. To me, the most useful dictionaries show the accent: e.g., 酒 saké (rice wine) vs. 鮭 sáke (salmon); 花 haná (flower) vs. 鼻 hána (nose). —Stephen 21:40, 19 August 2006 (UTC)

Comment: Please make sure that this discussion is copied to Wiktionary talk:About Japanese when it's completed! This is useful discussion to refer to. --EncycloPetey 01:15, 20 August 2006 (UTC)

Language standards for categories

Another consistency thing. We are currently all over the map with respect to categories and languages. Some categories are predominantly "spelled out language + category name" (ex. Category:Nouns by language), others predominantly use ISO-639 codes (ex. Category:Anatomy) and still others mix and match (Category:Proverbs). I had spent a lot of time converting all of the categories to use the ISO codes because I was under the impression that Wiktionary was moving in that direction.

In the case, of Mandarin, we now find are ourselves with three choices:

Category:zh:...
Category:Mandarin:...
both Category:zh:... and Category:Mandarin:... but each with slightly modified organizaiton.

Before we get too much further along. I wanted to see if anybody was brave or foolish enough to attempt to formulate a policy about which way to go. I realize this issue may have been raised in the past. However, I think it might be time to bring it up again. Any suggestions? A-cai 03:04, 19 August 2006 (UTC)

See discussion in WT:GP, what we are moving toward (and are mostly at, this isn't completely new) is that POS categories (Nouns, etc) use the language name, Category:Mandarin nouns and topic categories use the code prefix when not English, Category:ja:Horses, subcategory of Category:Horses. One reason for the code prefix is that "Japanese horses" wouldn't make sense; it would have to be "Japanese words about horses", ja:Horses is easier. As is pointed out in the GP discussion, "it:Nouns" should be a topical category, containing Italian words about nouns ;-) Robert Ullmann 12:09, 19 August 2006 (UTC)

I don’t like the language-code formats except where they are logically required. In Russian, for example, {{Category:Russian cities}} means cities inside Russia, while {{Category:ru:Cities}} is for cities all over the world in the Russian alphabet. I recently saw someone changing all instances of {{Category:Dutch prepositions}} to {{Category:nl:Prepositions}}, and that’s what makes it so confusing. The prepositions really are Dutch prepositions, and they should be called Dutch prepositions. {{Category:Japanese horses}} would only make sense as a category for Japanese breeds; a category of horses that would include Tennessee walking horses and Arabian breeds would be {{Category:ja:Horses}}. Unfortunately, some people have been creating all sorts of categories in the lang-code format, which makes it so confusing that nobody can guess what name may be in use, so either they throw up their hands and avoid categories altogether or they create redundant and competing categories. It’s very easy to know whether to use "Dutch" or "nl" for any given category, but the issue is clouded by a few who haven’t bothered to think and are just creating lang-code categories because they look high-tech or something. —Stephen 22:14, 19 August 2006 (UTC)

Allow me to try to step back and examine why people use the codes (besides wanting it to look "cool" :-). The purpose of the language codes is to tell the reader that all words in that category are in a given language. The advantage is that the codes conform to an international standard and nearly always refer to the same thing. The disadvantage is that the codes are not plain English, and are therefore difficult to understand by some. The problem with plain English is lack of precision. The word Japanese can refer to either the language or to things related to the country or culture of Japan. It is because of that lack of precision that some people opt for ja:Horses instead of Japanese horses.

I think perhaps the solution might be to find a way to use plain English to designate the language of the words in a given category that is precise, but not overly wordy. For example, instead of ] and ], what if it were ] and ]? That way, you could have ] (Japanese words for horses that are indigenous to Japan).

The only remaining question would be how to indicate which script the category uses. Some people seem to feel that the script is irrelevant, opting to place everything into the same category (ex. romaji, kanji, hirakana, katakana all in the same category). However, there are distinct advantages to separating the categories by script. For one thing, it will reduce the size of the category. It also makes for a cleaner table of contents. In the case of Mandarin, I would favor:

]

etc. Similarly, for Japanese:

]

in other words, the ISO codes more or less spelled out in plain English. I think this would give us not only precision, but also the plain English preferred by Stephen and others. I'm guessing that resistance to this may stem from the length of the category names, but there are always going to be tradeoffs. Comments?

A-cai 23:42, 19 August 2006 (UTC)

I can see the rationale for retaining the plain language category name whn the category is "about" the language, rather than merely "in" the language. In the Dutch pronoun example, arguments could go either way. I think though that distinguishing between "Japanese horses" and "Horses in Japan" has the potential to become even more confusing.

Using the language codes in this way is also related to sorting, and to developing categories that can be exactly paralleled for words in each language. You will note that the language codes are in lower case, while the categories themselves begin with an upper-case letter. This was intentional because it forces them to be sorted separately apart from the categories as used for words in English or in other languages. Whether we are talking about Horses, ja:Horses, or fi:Horses we know that we are talking about the same thing in different languages. Eclecticology 00:29, 20 August 2006 (UTC)

Understood, but what about my suggestion of using ] instead of ] (which would also distinguish the language)? Are you saying that you would rather stick with the ISO codes?

A-cai 00:52, 20 August 2006 (UTC)

The ISO codes are still shorter. For your idea to work it would need to be Category:japanese:Horses, i.e. with a lower case j. Eclecticology 01:01, 20 August 2006 (UTC)

The additional reaon for the language codes is that they follow an accepted standard. There are languages that have more than one name in English, such as Slovene versus Slovenian. In any case, both Category:Dutch prepositions and Category:nl:Time sort into the same super-category of Category:Dutch language, so anyone perusing the categroy will see both and make the mental connection. It is also possible to look up the ISO codes on Wikipedia to find out what they stand for. --EncycloPetey 01:13, 20 August 2006 (UTC)

This may sound like a silly suggestion, but I think it might have merit, so I will throw it out there. The main sticking point is part of speech categories such as ] (instead of ]). Based on the above arguments, both are correct ... and complementary. In other words, the words in Category:Japanese nouns are Japanese nouns written in Japanese. I think what we really mean to say is ]. This would satisfy the need for plain English, but also allow for standards-based language classification. My reason for wanting the language code is that all categories are supposedly derived from Category:*Topics (Category:*Topics -> Category:Linguistics -> Category:Grammar -> Category:Parts of speech -> Category:Nouns). I think it looks odd to have Category:zh:*Topics -> Category:zh:Linguistics -> Category:zh:Grammar -> Category:zh:Parts of speech -> Category:Mandarin nouns. However, some people may want to retain the word Mandarin in the category name, so why not ] (instead of Category:zh:Nouns)? Would this be an acceptable solution?

A-cai 02:06, 20 August 2006 (UTC)

Up to now, the top level category for each language has been something like Category:Latin language or Category:Dutch language. The parts of speech are indexed under this category as is the *Topics category. Are you suggesting a complete restructure of the entire hierarchy?

I think you're missing the point that the Nouns category includes words that are nouns, whereas a category titled zh:Nouns would include Mandarin words about nouns. How does your suggestion tackle this problem? --EncycloPetey 02:13, 20 August 2006 (UTC)

I'm not sure I concur with your logic; I believe it to be circular. If words in the Nouns category are nouns, why would the zh:Nouns category be about nouns (and not: are nouns which are written in Mandarin)? For that matter, why wouldn't the Nouns category be about Nouns, in which case we should have a category called ] or ]which are nouns?

My answer to your second question is no, I'm not suggesting a restructuring of the entire hierarchy (unless the result of our discussion points us in that direction). I'm merely requesting a clarification. What do we mean by placing the ISO code in front of a category name? I was under the impression that it meant that the words in zh are written in Mandarin, or ja written in Japanese. I was not aware that it should mean anything else other than that (ex. Category:Computing lists English words which are computing terms, but Category:zh-cn:Computing lists simplified Mandarin words about computing terms? ... that doesn't even make sense).

In other words, we need category names that are not ambiguous. We need category names whose meaning is precisely understood. ] has the virtue of being unambiguous. From that name, we should be able to discern that:

the words are all written in Mandarin
the words are all written in Simplified Chinese script
the words are all Mandarin nouns

I think we need this type of precision, we also need consistency. I will close by pointing out one more thing. If the categories had intuitive names (as I think you're implying they already do), I doubt we would be debating about the meaning of nouns vs. Japanese nouns vs. ja:nouns :-)

A-cai 03:47, 20 August 2006 (UTC)

I agree that precision and consistency is paramount, but it can be difficult to achieve this and maintain scalability at the same time. If I were developping this concept now rather than over two years ago I would probably use the three-letter codes from ISO 639-3 instead of the two-letter codes. Category:Dutch language is just fine as a linking category to categories for Dutch words; Category:nl:*Topics is an essential sub-category of that.

I find that "Nouns" as a category is thorougly useless, as is "English nouns", as is "Mandarin nouns". Replacing these categories with ISO codes doesn't change that. What concerns me about this with Mandarin is the extent to which we may be imposing the Indo-European concept of "noun" on a language that thinks quite differntly about the grammatical function of its terms. How did the Chinese conceive of grammar before the arrival of European influences? Eclecticology 04:38, 20 August 2006 (UTC)

While the merit of applying the "noun" label to Chinese languages is certainly questionable, the implication of EC's above objection to a "Nouns" category and to Category:English nouns in particular is that there should be no such category. If that topic is to be discussed, it should be its own Beer parlour section, in which it will be important to note that any given editors' inability to use a given category does not constitute a ban on such a category. I.e., a category only needs to be useful to a significant population within a WikiMedia project. Rod (A. Smith) 07:43, 20 August 2006 (UTC)

If you want to start such a thread, go ahead. Also, I said nothing about banning; the attitudes over such a category have been mixed for a long time. I simply don't waste my time adding it to any article, and occasionally I remove it. If you think it's useful then you should be able to show what a person gets out of looking up the category. Eclecticology 23:41, 20 August 2006 (UTC)

Whew! I thought you were hinting toward a policy against large categories. In the interest of avoiding a debate, I'm glad you aren't. Rod (A. Smith) 01:42, 21 August 2006 (UTC)

I would still encourage efforts to break down all big categories into manageable bites. Eclecticology 04:12, 21 August 2006 (UTC)

It's quite true that for a category to be useful to people it should be neither too big nor too small, but it's worth noting that there are, potentially, more uses for categories than presenting them to people as useful categories.

In particular, there seem to be several efforts to use them to impose additional bits of structure, and enable additional bits of automated processing, beyond what MediaWiki natively supports. In the case of nouns, for example, tagging them with ] is a somewhat more definitive way of indicating that a word is a noun than putting a ==Noun== (or maybe ==noun==, or ==]==, or =={{noun}}==...) header somewhere within its article. See Wiktionary:Grease pit#Moving POS categories into POS templates for some more discussion and examples.

(I'm not taking a position here on whether these "mechanical" categories do or don't make sense or are or aren't reasonable or useful, merely that they exist.) —scs 12:57, 21 August 2006 (UTC)

The POS categories have absolutely no businbess being in the templates. Those templates are for the inflections only. If someone wants a list of nouns putting "===Noun===" in the search box will have the same effect. What takes place on Grease Pit is technical discussions. Non-technical people are not likely to follow the discussions there so any agreements there should be viewed as among techies only, and should not be imposed on the general community. Eclecticology 09:12, 22 August 2006 (UTC)

That is a blindingly false, duplicitous statement. Categories have numerous other uses than direct navigation. To say that something is "imposed" falsely, in order to impose your own POV (while damaging the entire Wiktionary project) is not conducive to harboring cooperation. What reason do you have for disliking the addition of useful categories? Your objection is unclear. Perhaps if you explained why you object to Category:English nouns, a discussion could ensue. --Connel MacKenzie 16:17, 22 August 2006 (UTC)

Whoah, Connel, dude, chill! I don't agree with Ec's position, either, but I'm not gonna call him a duplicitous wheel warrior just for holding it. Perhaps if you avoided antagonizing people quite so much, they might be more willing to have that discussion you're hoping for. —scs 01:01, 23 August 2006 (UTC)

The statement he made was what I was name-calling, not he, himself. The action of making numerous anti-community edits first then looking for a discussion is not the sort of thing I'd expect from a bureaucrat, even (or especially) after a long absence. Making template changes that clog the job-queue, while ignoring developments from the past few months, is more than a little antagonistic. The pointless wheel warring is evidence of itself. I don't hope for such a conversation, I expect it, as does any other member in the Wiktionary community. But before changes, not after, --Connel MacKenzie 06:37, 23 August 2006 (UTC)

(unindenting for space) On what authority does EC control the purpose of the headword/pos/inflection templates?! English nouns are no longer listed in a catefory because EC reverted a perfectly harmless and useful edit to {{en-noun}}. Putting "===Noun===" in the search box will not have the same effect because that search yields non-English results. Rod (A. Smith) 14:53, 22 August 2006 (UTC)

Clearly not the Wiktionary community. While in the past, compelling arguments were raised against this type of useful category, those arguments have since been rendered invalid. I strongly recommend Ec reads the archives before engaging in a wheel war. --Connel MacKenzie 16:10, 22 August 2006 (UTC)

Response

I agree with your observation about nouns. There are a number of words in ancient texts that can work as multiple parts of speech. The most famous example is from Laozi:

道	可	道	非	常	道,
the way/doctrine	that can be	talked about/made a doctrine/be quantified	is not	a constant	way/path/doctrine,

名	可	名	非	常	名
the name	that can	be named	is not	a constant	name

道 and 名 can be interpretted in many diferent ways, which is exactly Laozi's point. Once you start talking about something as mysterious as "the way," all attempts to verbally describe it will ultimately fail (and yet, like us, he still tried :0).

The above is obviously an extreme example. Never-the-less, modern Chinese is a much more contextual language than English. English is an extremely verbose language, when compared to Chinese. I do think there is value in stating that a given Chinese word is a noun, but I agree that a noun category may not be that helpful. This is why I have made an effort to mirror as many of the English categories for the Mandarin entries (ex. Category:Flowers, Category:zh-cn:Flowers etc). I was using nouns as an example because, ironically, it is the most used category.

A-cai 06:24, 20 August 2006 (UTC)

It's this "most used" characteristic that makes it useless. To be most useful a category must be neither too big nor too small. In any broad category having too many sub-categories with only one or two items probably means that they are too small. A category is probably too big if it has to be spread over more than one 200 item page. Your example from Laozi is very good, and may even be reflected in the tendency in English to use nouns as verbs, as with "to google". Perhaps the part of speech is more a quality of usage than a characteristic inherent in the word itself. In terms of verbosity and contextuality English is closer to Chinese than other Indo-European languages. Many of them, particularly the Romance languages depend far more on small connective words or complex inflections. When an English text and its French equivalent are put side-by-side, the French version is almost invariably longer.

When it comes to something like Category:Flowers wew are dealing with a topical category. These are, of course, not free from their own problems, but they are far less concerned with lexical and grammatical structures. Eclecticology 23:41, 20 August 2006 (UTC)

I can tell you how I have been using the nouns category so far. The vast majority of new words being entered into Wiktionary are in fact nouns. Most people who enter these new words are putting the new words into the noun category (if they assign a category at all). I can then go through the nouns category and separate out words according to topic (ex. this is a flower, this is a computing term etc). Wiktionary currently lacks a good alternative to this.

There is no obligation to put a word into any category, and putting something into the noun category just for the sake of having a category is an utter waste of energy. I readily admit that the categorization needs to become more developed, especially when dealing with more abstract concepts, and perhaps the Wikisaurus concept could be used to great advantage for this. Eclecticology 07:39, 22 August 2006 (UTC)

With respect to your second point, I'm not sure I agree with your 200 limit philosophy. I agree that limitations in the current version of Wikimedia software make working with categories over 200 words difficult. However, that doesn't automatically mean that large categories are inherently bad! We should strive to make the software meet the needs of contributors, not the other way around.

That's exactly why these large categories are bad. When you search using Google, and it gives you a large number of results, how many of them do you really look at? It's the same with large categories; no-one is going to want to go through thousands of items in the hope that he will find what he's looking for.. Breaking the category down into digestible pieces helps this process. Eclecticology 07:39, 22 August 2006 (UTC)

The reality is that Wiktionary is going to grow, and you WILL start to have categories that contain several thousand words. I own a dictionary that contains over 8,000 four-character idioms alone (which is a modest number of idioms, remember Chinese is one of the oldest living languages)! In order to make the current Mandarin idioms category accommodate such numbers, we would have to have categories such as for each Pinyin syllable! You may say, well what's wrong with that? Nothing, unless you then want to sort the idioms by radical/stroke order! The current solution would be to slap another category called idioms by radical/stroke and split it up by radical/stroke categories! I'm not saying this couldn't be done, even with Wikimedia software in its current state, but let's think about this a second. The only reason for splitting it up this way is that Wikimedia software can't handle large amounts of data.

Not at all. It has to do with accomodating how people look for things. Each pinyin syllable should be in an article, either alone or with identically spelled words in other languages. If these are properly arganized it should make those categories redundant. In some cases an index may be more appropriate, and maybe categories is just not the best techniqe. Eclecticology 07:39, 22 August 2006 (UTC)

I'm not as afraid for Wiktionary because it is a smallish project when compared to Wikisource or Wikipedia. What happens when someone decides they want to put all 2500+ pages worth of Zuo Zhuan onto Wikisource (already been done for Chinese, I'm just waiting for someone to throw the English translation onto the English Wikisource)? Wikimedia software has some serious challenges ahead of it, if it really wants to do what we all eventually want it to do!

A-cai 11:00, 21 August 2006 (UTC)

Absolutely, but the challenges for Wiktionary and Wikisource will remain different. Assuming that there are no copyright issues regarding the English version of the Zuo Zhuan I would be perfectly delighted to have it in Wikisource. That, plus it should have a concordance which interlinks with Wiktionary. But the fact that I can envision such possibilities does not mean that I believe that they are immediate possibilities. Eclecticology 07:39, 22 August 2006 (UTC)

New List

I started a new list of words based on a regular article called Technically Speaking in the magazine Spectrum published by the IEEE. I've paraphrased the words from the article. It's currently at User:RJFJR/Spectrum technically speaking but should proabably be in the appendix space. Opinion on moving? RJFJR 20:08, 20 August 2006 (UTC)

Removing script-form headings from Japanese entries

(refer to WT:AJ) Most of the Japanese entries have script form headers entered as if they are parts of speech. This often leads to entries (particularly for hiragana and romanji) lacking POS headings entirely. Even the examples in WT:AJ show very non-standard entry formats. (There is even a heading furigana which is not even a script form.)

It (WT:AJ) starts out defining new formats for Kanji, Romaji, and Hiragana entries, not in WT:ELE style, then has a very good description of the POS forms for the Japanese language, then mentions again that the script forms should be POS-equivalent headings, which they clearly should not be!

Compare the original form of okashi with the current version. Look at the original first. Imagine you are a user of the English wiktionary who knows nothing of Japanese (what are all those squiggles?) that's why you are looking it up. Ask yourself:

Would you know that this is four different words, that all happen to be spelled "okashi"?
Would you know that (3) is a noun, the noun stem of a verb form?
Would you have any idea which kind of squiggle is which script?

(Note that User:WereCarrot followed the WT:AJ form, and this is a fine entry, no criticism there!)

Now look at okashi, which is in WT:ELE standard form, with (e.g.) the ja-noun template?

Can we please remove the non-standard things from WT:AJ while keeping all of the excellent POS structure and conjugation templates and so forth? Robert Ullmann 12:05, 21 August 2006 (UTC)

I prefer the format you suggest, but editors promoting the script headers and the abbreviated format expressed the desire to minimize repetition between the romaji, kana, and kanji entries. Lacking a proper separation of data and presentation, Wiktionary currently has no alternative but to suffer some such repetition. Rod (A. Smith) 22:56, 23 August 2006 (UTC)

Look at kansen (please...) This uses the POS header and POS template, and it otherwise consistant with the format that those headers preferred/prefer. Tell me what you think. If the romaji or hiragana was for several different POS's, there would be several sections. Robert Ullmann 11:48, 26 August 2006 (UTC)

The operative word being repetition ... Robert's templates do minimize the pain of repetition, but it does not eliminate the need for it. I would personally favor someone writing a bot that would automatically go in and create appropriate sister entries for the same word. The problem is that one entry would need to provide ALL relevant information for each entry. I'll use a Mandarin example to illustrate the problem:

質量 - {{cmn-noun|t|pin=zhíliàng|pint=zhi2liang4|tra=質量|sim=质量|rs=貝08}}
质量 - {{cmn-noun|s|pin=zhìliàng|pint=zhi4liang4|tra=質量|sim=质量|rs=贝04}}

Note, that the radical/stroke changes for the head character. Additionally, the tone of the first syllable changes for the traditional character. This is due to differences in standards between the PRC (simplified) and Taiwan (traditional), similar to what we see between British and American English. In order to capture all of this in one template, you might need to do something like:

{{cmn-noun|s|pincn=zhìliàng|pintcn=zhi4liang4|pintw=zhíliàng|pinttw=zhi2liang4|tra=質量|sim=质量|rst=貝08|rss=贝04}}

Unless all of that were in an entry, regardless of whether it were Pinyin, traditional or simplified, the bot would not have the information it needs to create appropriate sister entries. This is why I end up having to edit between entries, even when it is essentially the same word. At the same time, a template that includes all the needed info would be very cumbersome to use. We need a gui for this stuff!!! :) A-cai 11:24, 27 August 2006 (UTC)

Dealing with Wonderfool

After Wonderfool's latest sock DanOfDublin (talk • contribs) was blocked - I ask the question - how are we meant to deal with this user/vandal??

I've noticed two things recently: 1. He's become fixated with briefs. 2. He's making personal attacks. 3. He's posting patent nonsense.

How do we deal with this problem?? --SilverPools 13:43, 21 August 2006 (UTC)

I don't think this is Wonderfool at all; IIRC he showed up when we reverted changes to Northumbria. (Yes; look at history.) Just another mindless vandal. Wonderfool is much more intelligent, if often annoying. Robert Ullmann 14:57, 21 August 2006 (UTC)

I agree. Widsith 18:21, 21 August 2006 (UTC)

I agree with Robert Ullmann and Widsith; these give every indication of being simple copycat vadalism entries. As such, they are routinely dealt with just as any other routine vandalism. --Connel MacKenzie 20:53, 21 August 2006 (UTC)

Category:Japanese romaji and Category:Romaji

I think these categories might have to be merged. 19:07, 21 August 2006 (UTC) — This unsigned comment was added by Felka (talk • contribs). Please use four tildes.

The first only has nine items, and appears to be obsolete. Eclecticology 08:08, 22 August 2006 (UTC)

Category:Japanese romaji is new, re-naming the other, that should have had "Japanese" in its name. You can't just move a category. Robert Ullmann 11:25, 22 August 2006 (UTC)

I thought categories beginning with "Japanese" were to be terms about Japanese topics (e.g. English words about Japanese culture) and that categories beginning with "ja:" were to be used for Japanese terms. I'm not sure I have that right, though, so I welcome correction. In any event, when we settle on the right category name, it would certainly be in order to bot-migrate from Category:Romaji to its new name. Rod (A. Smith) 23:02, 23 August 2006 (UTC)

Seems to me these (hiragana as well) should follow the POS categories (Japanese nouns) rather than the topical categories (ja:horses), these words are romaji, not about romaji. (kanji is well established as "Japanese kanji") But in any case, we should be able to change the name if desired by editing a small (5+) number of templates. Probably won't need a bot. Most use Template:romaji now, although there a a number of redundant categorizations. Robert Ullmann 11:40, 24 August 2006 (UTC)

Featured word candidates reformat

I've put together a proposed reformat for Wiktionary:Word of the day/Nominations at Wiktionary:Word of the day/Nominations/Format Proposal. It's basically the current one, but more aesthetically pleasing and with a bit more finesse. Any ideas, questions, or productive, angry rants are welcome. Foxjwill 16:19, 13 August 2006 (UTC)

Final Solution-Wonderfool

Declarations;

I am not a wonderfool sockpuppet.

Solution;

Immediate appointment of a temp Checkuser. Sockpuppet checks on all officials.

Geo.plrd 20:41, 22 August 2006 (UTC)

See #CheckUser run on all sysops above. --Connel MacKenzie 19:09, 23 August 2006 (UTC)

I suppose I should have my run at that, then, while my admin nomination is pending. bd2412 T 19:14, 23 August 2006 (UTC)

Well, I don't know that the meta: stewards will permit it, even if the final vote does approve CheckUsering all sysops. I expect you will be a sysop long before that issue is decided. --Connel MacKenzie 20:07, 23 August 2006 (UTC)

I've asked Kelly Martin to run one already. I don't imagine anyone is on edge in my case, but if the community wants this safeguard in place, I'll be glad to set the precedent. I think they'll permit it for me. bd2412 T 20:44, 23 August 2006 (UTC)

Excellent idea. I'm requesting her to check me now, also. --Connel MacKenzie 22:28, 23 August 2006 (UTC)

She is more than welcome to check me as well. —Stephen 22:34, 23 August 2006 (UTC)

Both of these editors have "sockpuppets", but I suspect you all already knew about their respective bots. Otherwise, I see no cause of suspicion. Kelly Martin 22:39, 23 August 2006 (UTC)

Pluto, planet, dwarf planet

Moved to Tea Room; very interesting subject. Robert Ullmann 20:09, 24 August 2006 (UTC)

w:Wikipedia:AutoWikiBrowser

It's been noted that my use of the AutoWikiBrowser might raise some concerns, as it has some functionality similar to programs for which permission is usually required on this project. I'd like to take a minute to sell this tool - it's a nothing more or less than a tool, and like all tools can be used for good work or mischief (intentional or no). With a little experience and an attentive user, it's actually quite a good tool, and I have used it to great effect on Wikipedia for everything for a wide variety of purposes, such as category changes, fixing common spelling errors, and disambiguation fixes.

The AWB is not a bot - the user has to view and approve each change, but if they user is moving fast and approves a change they ought not have, they should catch it and can fix it right away. Frankly, Wikipedia is reaching a scale where it would be effectively hamstrung without a large number of users using this tool to carry out a variety of tasks which require hundreds or thousands of edits to achieve consistency or fix common problems. It is a matter of time before Wiktionary reaches that scale (maybe it's there already), so I encourage this community to embrace the functionality of the AWB, at least in the hands of those who understand how to use it. Cheers! bd2412 T 03:57, 26 August 2006 (UTC)

Ive tried it out, it really is too bad it is so huge; it was written just for Windows, and has a 12MB download (twice the size of the entire Firefox download!) It requires IE, and pulls in the entire .NET framework just to ge access to the regex calls (even though there is plently of GPL regex code. Pretty horrible. It should be about 1MB, and not Windows specific. Against that, it does actually work ....

One note is that Wikipedia has some check magic so that only users with bot flags can use it in bot mode (yes, it can also be a bot), I think we lack that magic? Maybe we should see? I don't think we need AWB in bot mode here at all; we have a number of people who can do Python with proper care.) Robert Ullmann 11:41, 26 August 2006 (UTC)

I wholly agree that AWB should not be useable as a bot without specific community permission for that, but I have no intent of using it in any kind of bot mode myself. I like the control of checking each edit as it is made. The 'huge'ness is a matter of personal preference - it has not bothered me thus far. bd2412 T 14:08, 26 August 2006 (UTC)

category/tag for railroading terms?

Do we have a standard tag for railroading terms? I found one use of {{cattag|railways}} at points, but that's only one data point. I'm not sure whether the tag should be railroad, railroads, railroading, railway, or railways, but of course it would be good to be consistent. —scs 18:59, 26 August 2006 (UTC)

How about just "rail", which is a broad term for railways, transport by train, etc? — Paul G 09:22, 27 August 2006 (UTC)

This is probably best for avoiding the US/Commonwealth difference between "railroad" and "railway". Eclecticology 09:27, 28 August 2006 (UTC)

I started to set this up in Category:Rail transportation to bypass ambiguities with other possible uses of the word "rail", but now there also seems to be an undescribed difference between the noun transport and transportation. This may be a difference between North American and British usage, but the dictionaries that I have looked at so far do not clarify this. Eclecticology 19:34, 28 August 2006 (UTC)

The way to be consistent is to have the other templates redirect to whatever is chosen. DAVilla 09:30, 27 August 2006 (UTC)

At which point no-one corrects any of it. We want all elements of the set to show up, and a wrong category to at least show up in red. Eclecticology 09:27, 28 August 2006 (UTC)

EC, if the templates redirect to the canonical form, there won't be any "wrong category". Or inconsistant tag presentation. And there is nothing to correct. Sigh. Robert Ullmann 14:57, 28 August 2006 (UTC)

It's not an issue of templates. Jumping in with templates at this early stage of the subject only makes the categories inflexible, and subdivision more difficult. Eclecticology 18:43, 28 August 2006 (UTC)

It is and isn't an issue of templates -- there are two (or three) different issues here.

Clearly we want one consistent master name for the category/tag. But if we can't decide what to call it, or, more significantly, if we suspect that everyday editors won't always remember what to call it, then "linking" to it from multiple differently-spelled templates maximizes both consistency and convenience. That is, rather than forcing each editor to constantly refer to some master list of single approved per-category template names, we allow them to use either (say) {{railways}}, {{railroading}}, or {{trains}}, making it the more likely that they can simply use the one they remember.

Also, this isn't necessarily a category issue at all; I asked about categories only because many of our tagging templates are linked to categories. My real question was just about the name of the template (if any) to be used, and the spelling of the resulting italicized tag (if any) to appear on a definition line. (Personally, I'm not sure how useful the corresponding categories for tags like these actually are.) —scs 13:27, 30 August 2006 (UTC)

Flapped "t" in US pronunciations

I don't know who entered it, but the US pronunciation for "metaphor" had /d/ representing the "t".

I appreciate that this was an attempt to represent flapped "t" (see flapping in Wikipedia), but /d/ is not the right IPA symbol for this phoneme - rather, it is /ɾ/. Either we use this consistently for all intervocalic "t"s (and "d"s, for that matter) in US pronunciations, or we stick with the more broadly representative /t/ (which I think is the better option). Using /d/ for flapped "t" is incorrect and misleading. — Paul G 09:21, 27 August 2006 (UTC)

I am not sure that describing a small pocket in the Boston area as being representative of all en-us speakers is fair. If a flapped version is entered, it should be separate from US, identifying that tiny area. (But this minor note is probably more appropriate in the tea room.) --Connel MacKenzie 17:50, 27 August 2006 (UTC)

How is this an issue of Boston? And what the heck is /ər/? DAVilla 18:18, 27 August 2006 (UTC)

I know that I don't pronounce it 'flapped' and don't recall hearing it. But the Wikipedia article Paul linked above refers to a "small pocket in the Boston area." --Connel MacKenzie 18:30, 27 August 2006 (UTC)

Oh, I see the reference. No, that doesn't apply to this case, since the t isn't followed by a syllabic n. I wasn't able to find the absence of flapping among any of the w:American English regional differences, so I think it must be a pretty universal characteristic. DAVilla 19:56, 27 August 2006 (UTC)

Maybe I'm understanding what you call "flapping" incorrectly? (deprecated use of |lang= parameter)

US:

(file)

does not change the "t" to a "d", or does this term mean something else? (Note: Dvortygirl indicated that she was unaware of this conversation, when she recorded it.) --Connel MacKenzie 20:21, 27 August 2006 (UTC)

Yeah, that's flapping, the American tap somewhere inbetween and . If you were to actually say in "metaphor" then it would sound strange, either too proper or forced in enunciation. DAVilla 16:04, 28 August 2006 (UTC)

Firstly, I was under the impression that the vast majority of US speakers merged /t/ and /d/ in this position. Certainly a significant number of Australians do.

Secondly, Paul G needs to learn what phonemes are - is not a phoneme in English. If you are going to use the symbol , then you must put it in , not /slashes/. --Ptcamn 05:52, 28 August 2006 (UTC)

Quite, though it's hardly Paul's fault since he's just trying to deal with the originally-entered (phonetic) . The point is that there is a long-established convention here that our IPA transcriptions should be phonemic and not narrowly phonetic, therefore, /t/ is the appropriate character in this case. Widsith 18:46, 28 August 2006 (UTC)

I agree that flaps are subphonemic in American English and shouldn't be indicated in a dictionary. Nevertheless, there seems to be some desire among British lexicographers to show the flaps if they include American pronunciations. For example, the Longman Pronunciation Dictionary by British phonetician J. C. Wells consistently shows flaps in its transliteration of General American. And an American linguist I know who works in Britain was approached by a publisher preparing a dictionary that was to include American pronunciations as well as British ones; they wanted to use /d/ to indicate the flap and he had to work very hard (including an e-mail writing campaign from American linguists) to talk them out of it. Angr 09:58, 29 August 2006 (UTC)

I don't see why narrow transcriptions shouldn't be allowed. In fact I consider such broad transcriptions to be misuse in a multilingual environment such as this. IPA was designed to distinguish very narrowly between these phones. If the majority wants to use /r/ then fine, I suppose that's neutral enough. But nothing's wrong with while is just incorrect. Now don't blame me for bringing up the "r" issue because it's a lot broader than that. I merely want to exemplify that all of these things are tied together. And no, this won't go away until we get people here with authority on the matter. Unfortunately that excludes Paul and myself and a lot of other people who would like to chime in. DAVilla 10:58, 29 August 2006 (UTC)

If we used narrow transcription we would be distinguishing between every tiny dialectal difference; we would need literally hundreds of IPAs for the UK alone. It isn't practical, and if it were done it would be too unwieldy to be useful. The whole point of phonemic transcription is that it is applicable to the widest number of speakers. However, the real argument which we need to thrash out (and which you seem to argue for above) is what exactly the phonemes of English are. Since we already distinguish between UK and US pronunciations, there is definitely a case for using /ɹ/ and /ɻ/ respectively, since /r/ is a bit misleading when almost no dialects of English use the sound, and when other languages which may appear on the same page do. These details should be established because at the moment the conventional practice is not in tune with Wiktionary:English pronunciation key, which is linked to from every IPA template we use. Widsith 18:51, 29 August 2006 (UTC)

Am I missing something here? As DAVilla pointed out just above, flaps are subphonemic, and as far as I know, the versus versus distinction is, too. So if we were attempting a phonemic transcription, it seems to me there would definitely not be a case for "using /ɹ/ and /ɻ/ respectively". If we were attempting a phonemic transcription, it seems to me we would, yes, just use /r/, where this means "the r sound, however you pronounce it" (that is, whether it's , , , or sometimes ).

I think the real problem here is that we're never really sure whether we're attempting phonemic or phonetic transcriptions, or both, or something in between. If we could somehow decide on just one or the other, many of these eternal discussions would finally find definitive answers. But of course we can't decide on just one or the other (nor do I think we necessarily should), and as long as both of them are in the picture, we can wander endlessly between the two poles, confusing our phones with our phonemes and never really being sure what we're talking about. —scs 19:14, 7 September 2006 (UTC)

There is only one w:Received Pronunciation and one w:Standard Midwestern, and if people are inclined to add others then more power to them. As it stands the minor differences in dialects are swept under the rug. DAVilla 21:46, 29 August 2006 (UTC)

Request for sysopship

May I please be made a sysop temporarily in order to add an image of a barrel nut? I will give up my sysopship immediately after adding the image. Raifʻhār Doremítzwr 20:00, 29 August 2006 (UTC)

You don't need to be a sysop to upload an image, simply head to Commons and upload it there. We are no longer storing images on this project but using commons images. - TheDaveRoss 20:05, 29 August 2006 (UTC)

OK, thanks. It is done. Sorry for my cluelessness. Raifʻhār Doremítzwr 20:20, 29 August 2006 (UTC)

Han characters

In doing a bit of work on templating CJKV languages to get entries much closer to our standard style, I ran into a large mess ... you may know of ... the pages for Han characters are completely different, due mostly to a user "Nanshu" who really wanted the Wiktionary to be an XML database ...

See 詞. Which is just barely a dictionary definition of the Han character for "word". I'd like to see these become real pages, while retaining the useful information (but see note below on "Morobashi").

I think the entry for (e.g.) 詞 should have a Translingual header, under that Han Character (like Symbol is used), with some of the information (radical/stroke number, etc), References, with some of the dictionary information, code points in Unicode, etc. This should (IMHO) be in two templates so we can mess with the formatting and categorization.

Then language sections for the 17 languages that use Han characters and either use the instant one as a word or as a combining form with definitions. Each would include language specific templates with romanizations and readings for that language.

Each entry will be categorized (by the template) in Category:Han characters in proper radical/stroke sort key order. Then also (by the templates) in language categories as established for each language.

Note: the information loaded by "NanshuBot" is apparently useful. But one wonders when one of the dictionaries referred to is named "Morobashi". The dictionary in question is the Dai Kanwa Jiten, compiled by Morohashi ...

Comments? Robert Ullmann 15:46, 9 August 2006 (UTC)

I will wait to see an example. It is easier for me to react if I can see something concrete. In general however, it sounds like a worthy cause, if somewhat daunting. I am a little concerned about the translingual header because I'm not sure we can count on individual characters always meaning exactly the same thing across languages. For example, one meaning of 空 (Mandarin, Pinyin: kòng) is "free time" or "leisure time" but it is not used this way in Japanese or Min Nan so far as I know.

A-cai 13:04, 10 August 2006 (UTC)

I'm going to set up an example presently. The character should be defined under the language headers; the only thing that might be part of the Translingual section is the "common meaning" listed now, which is not a definition. If we want to make it go away later, we can just remove it from the template ... Robert Ullmann 16:04, 10 August 2006 (UTC)

I've set up the promised example. It is -- of course -- 字. (What else?!) The character info is in the templates, but I haven't sorted out the language sections or tried to add proper definitions. Japanese is already fairly good. Cantonese and Mandarin need to be separated, and Min Nan added (presumably). I don't know enough to separate out the "compounds"; and we should decide what to call them; it is usually just "Related terms" I think. Anyway, plenty to look at. I carefully made sure the templates are close enough in structure to the NanshuBot stuff so that they could be modded in by bot (this point is critical; we aren't doing this by hand!) Than we can go from there to wherever we want. The language definitions of course have to be edited, no help for that, they all need definitions. Robert Ullmann 14:59, 11 August 2006 (UTC)

I will work on the Chinese sections in the next day or so. BTW, I have not liked the compounds section in individual entries for a while. It seems like excessive work to put compounds in there. Don't get me wrong, we need that information, but wouldn't Special:Allpages/字 suffice? If we are serious about putting compounds into individual character entries, prepare for me to dump 453 Mandarin compounds into 字 (based on 國語辭典 on-line Chinese dictionary)!

A-cai 15:43, 11 August 2006 (UTC)

Robert, I am finished with my initial stab at 字. Please take a look, and let me know what you think. Several format points that I'm unsure of. Where exactly to put the archaic meanings? I couldn't decide so I put them under both Translingual and Mandarin for the time being. When I tried to do the Bronze script and Seal script images as thumbnails, they were too big, so I shrunk both down to 100px each. Unfortunately, we loose the caption because of this. Is there a way to make the images both small and have the caption?

A-cai 05:19, 12 August 2006 (UTC)

I think the archaic meaning should just be under Mandarin? As you noted, even the "common meaning" bit is confusing; anything that is a definition should be under a language header. The Etymology is in just the right place. I took out the IPAfont template; IMHO in Wiktionary (unlike the 'pedia) IPA occurs all over, so it isn't necessary to keep mentioning it. Besides: if you've loaded East Asian character support, I think by that time you have IPA?

Notice what happens to the images when you click on "hide" in the contents box. Wiki just doesn't handle images very well when there are more images than text. Are there images like this for lots of characters? Perhaps we go add bronze=, seal=, and something for stroke order (I used so=, but I could change that) to the Han char template (as optional). I still need to learn a bit more about what we can do with images. Looks good. Robert Ullmann 11:07, 12 August 2006 (UTC)

Look at it now. In general, you only want to use something small like the zh-forms template(s) (or, e.g. interwiktionary, etc.) at the very top of the page, else it looks terrible with the contents box collapsed (hide). I put the image references right after the translingual header. Good place I think. ~~Left stroke order at default width~~, forced the bronze and seal scripts down to 70px. Robert Ullmann 11:23, 12 August 2006 (UTC) Inlined stroke order in the template. Robert Ullmann 12:01, 12 August 2006 (UTC)

The image files are part of a Wikimedia project to create a complete set of SVG images depicting ancient Chinese characters. Several dozen have already been scanned in. It would be terrific if we could eventually include images in each individual Han character entry that would be representative of at least the main styles of calligraphy. These include:

Of course, all this will take many years, but you have to start somewhere. Someday, when we have much faster computers and networks, we may be able to turn Wiktionary into a calligraphy dictionary by adding images of all major calligraphy variations of a given glyph.

Question, given that the root meaning of 字 does not closely match the modern day common meaning, do you think that the etymology section is sufficiently detailed? Should we be more explicit in tracing the evolution of its meaning over the centuries or are the provided archaic meanings sufficient for this purpose?

More etymology would be very good! Robert Ullmann 11:36, 13 August 2006 (UTC)

Other than that, I think the only other major thing we're missing are the sound files for pronunciation. Again it will take quite some time to get them all in, but well worth it. A-cai 14:56, 12 August 2006 (UTC)

Comments from the peanut gallery:

I plan to 'bot replace the Morobashi/Morohashi mess at some point. Is this becoming a priority?
Yes, please add audio. I mean, PRETTY PLEASE add audio. The current technique is spelled out at Help:Audio pronunciations.

--Connel MacKenzie 17:22, 12 August 2006 (UTC)

I'm working on something like a bot to move all the NanshuBot stuff into templates as in this example. The template (Han ref) at the moment uses Dai Kanwa Jiten instead of "Morobashi". I'm saying "something like a bot" because I don't think it should run on its own; there are too many little variations to tweak. I'm thinking I can run some Python code on each entry, then check it by hand. But that's just the present line of thinking. Robert Ullmann 11:36, 13 August 2006 (UTC)

Um, the problem being that there are 17,970 of these entries (down from 17,971 ;-). (Oh, Connel: please don't fix Morobashi! This is a very useful flag for these entries. Although there are other ways to figure out what to feed a hungry bot ;-) So I'm going to have to have the "bot" code know enough to identify the troublesome ones ... Robert Ullmann 11:55, 13 August 2006 (UTC)

I don't want to seem ignorant, but what exactly does Morobashi refer to? Is that a dictionary or something? Also, what are the numbers for? (ex. Hanyu Da Zidian: 21010.020) Are those a page number or something? I have an abridged version of this dictionary, but those numbers don't seem to apply to my copy, or if they do, I can't figure out how.

A-cai 12:10, 13 August 2006 (UTC)

"Morobashi" is a bad Kanji reading of the name of the compiler of the Dai Kanwa Jiten: Morohashi Tetsuji 諸橋轍次; it was introduced by user "Nanshu" the author of "NanshuBot" that loaded all of these entries into the wiktionary; it occurs nowhere else on the web according to google. The number is the position in the unabridged dictionary, e.g. Hanyu Da Zidian: 21010.020 means volume 2, page 1010, line 02, 0 means it is the character on that line, 1 would mean a character that would appear between that line and the next, but isn't in the dictionary. The other dictionaries have similar numbering schemes. (These were not invented by "Nanshu", they are the standard reference numbering for these dictionaries.) Robert Ullmann 15:32, 13 August 2006 (UTC)

I have a number of observations.

While Nanshu's work may be subject to valid criticism, it would not be proper to disparage his efforts. At the time he was the only one working on this kind of material, and the most common complaint was why there should be so much Chinese on the English Wictionary. The term "translingual" was not yet invented in a Wiktionary context.
I'm glad to see that you have not overworked the templates to serve multiple purposes as has happened in English.
The reference to the radical should show the standard radical number. Many dictionaries are organized that way and that will make it easier for the person looking up material.
Although we know that 宀 is not the radical for 字 we should find some way to accomodate those people who find this intuitive.
"Dumping" the 453 compounds of 字 into that article (or perhaps into a sub-page) is still a worthwhile exercise. This will show a series of red links for things that still need work.
I agree with removing the IPA font template for the reasons given. I know that Steve like these things, but I don't find that they accomplish much.
Etymology and core meanings in the translingual section are still important, as long as people take them for what they are. Elsewhere, everytime we tag something as "archaic" we are expressing a point of view that should be substantiated. In the short term this is highly impractical, but ideally we should eventually have quotations with dates for each meaning.
The Mandarin entries should have the pinyin wikified. This can be the basis for a lot of cross referencing.
The Mandarin entries should show the Wade-Giles romanization, and probably some other old ones. It is now obsolete but a lot of books from the past used it.
Is it necessary to have separate categories for simplified and traditional characters when there is no difference.
The Mandarin section currently divides the meanings into "Noun" and "Verb"; should we consider whether this is appropriate?

Eclecticology 22:31, 29 August 2006 (UTC)

Special:Allpages/字 ?

Would it be better if we use Special:Prefixindex/字 instead of Special:Allpages/字?

Template:zh-hanzi-box

-- Hiòng-êng 02:09, 29 August 2006 (UTC)

Apparently yes; there's no point to listing a lot of irrelevant material. Eclecticology 21:38, 29 August 2006 (UTC)

Eclecticology: In general, I agree with your observations. With respect to your question about characters that are the same in both traditional and simplified. Yes, these should be placed in both categories. There are several good reasons to do so. The most basic reason is that a character which is the same in simplified and traditional belongs to both the traditional character set and the simplified character set. This is particularly important for people who only know one or another script. If I only know simplified, I should be able to go to the simplified category and find all my words there. This way, I would not have to wonder if any traditional characters have been slipped in, which I would not want to see. I would also not have to worry about a word not being in the simplified category because it happens to be identical in traditional script.

Hiòng-êng, I like the Special:Prefixindex, I would have suggested it myself had I known about it.

A-cai 08:22, 30 August 2006 (UTC)

Ec, one more factoid for you to help frame the issue. One of the most common characters, 人, has the following in Guoyu Cidian Chinese dictionary:

2,662 - number of words and phrases which contain 人
465 - number of words and phrases that start with 人

A-cai 08:30, 30 August 2006 (UTC)

I edited Template:zh-lookup to use Special:Prefixindex instead of Special:Allpages. Hope no one objects.
-- Hiòng-êng 01:11, 4 September 2006 (UTC)

Category:English irregular plurals

See Category talk:English irregular plurals#Category name. bd2412 T 02:23, 14 August 2006 (UTC)

Since no one has popped up on the page linked above, I'll copy my proposal here: Let's move Category:English irregular plurals (which currently contains the singular of words that have an irregular plural) to Category:English nouns with irregular plurals and make this Category:English irregular plurals a category for the irregular plurals themselves, with subcats for Category:English irregular plurals ending in "-i", Category:English irregular plurals ending in "-ae", Category:English irregular plurals ending in "-en", Category:English irregular plurals ending in "-ves". bd2412 T 18:21, 16 August 2006 (UTC)

It seems to me a good startpoint. My question as a newbie is: who decides these moves and how are they effected? I have these questions since I observed that the category Grammar had subcategories for a number of things, including "Part of speech" which in turn hosts many things out of which "Conjuntions" but

in "English conjunctions" I couldn't find the expected branch for "English subordinating conjunctions" (hosting for example "in so far as"), and
in "French conjunctions" (my basic language reference), I couldn't find the due differentiation tree between "French coordinating conjunctions" and "French subordinating conjunctions" (all were put in one place, on same level).

I would have been more than happy to contribute the wiktionary by adding some such differenciation but could not realize how to make it. Please, help. PhL (philippe.lebourg at st.com)

Support. Makes much more sense. Jeffqyzt 00:19, 19 September 2006 (UTC)

Thanks, but this topic has not really gotten enough attention, so I'm going to pull an old lawyer's trick: Does anyone object to the fact that I will make the above proposed change if no one says otherwise within the next 24 hours? bd2412 T 02:53, 20 September 2006 (UTC)

NO!! ... um, yes, of course the category should contain the irregular plurals themselves .... ;-) cherubim and cherubims are good candidates. Robert Ullmann 05:59, 20 September 2006 (UTC)

Well, the singulars are all recategorized - now for the plurals... bd2412 T 02:12, 21 September 2006 (UTC)

Category:English nouns

First section of discussion

Considerable confusion has today resurfaced regarding this category. The historic objections (as far as I remember) to it were:

Large categories cannot be navigated
Additional categories in the wikitext are distracting

Both of those complaints have since been addressed.

Category navigation has better linking available. And direct navigation is not the only use of a category. Categories are used for building indexes, random entry tools, as well as building things for sister projects.

Category linking text no longer appears in the wikitext, by merit of being hidden properly in the inflection templates. While this does raise the issue of "hidden magic is happening" type of confusion, it is offset by the use of ever-simpler templates. Given enough leeway to proceed, the template simplification/automation can only make things easier for contributors, especially lost newcomers.

If there is something I've forgotten from the previous few years' conversations, or something that wasn't complained about previously, please list it here. --Connel MacKenzie 17:13, 22 August 2006 (UTC)

Connel has responded to this issue in several places, but I will limit my responses for now to this place alone. The question of whether Category:English nouns should be used has indeed been a matter of debate for some time, and is likely to remain debatable for some time yet. It is no secret that I find it utterly useless, and have held that view for a long time; that being said I have never engaged in its wholesale deletion. I have in most cases, out of respect for the variety of views in the community,only engaged in removing it when I replaced it with a more informative option. Burying this categorization in a complicated template that has nothing to do with categories can only be viewed as a way of making sure that that category remains the same no matter what anyone else thinks. As Scs said above: "In particular, there seem to be several efforts to use them to impose additional bits of structure, and enable additional bits of automated processing, beyond what MediaWiki natively supports."

The events that led to this imposition went quite quickly. The proposal to put the category in the inflection templates came in the middle of the "unnecessary adjective senses?" thread above; it did not even appear in the title of the thread. The thread was begun on August 5, the proposal was made on August 6, and by August 9 a bot which made massive changes was fully operational and soon managed to change many thousands of articles. There was also some discussion on Grease Pit but that is no place policy decisions because non-technical people do not spend time there. All approval processes for the bot were ignored, and an inexperienced bureaucrat authorized the bot without so much as an attempt to discuss it with his collegagues. (Dvortygirl and I were both at Wikimania.) There was no emergency. Connel has been through the bot approval process before, and he will vouch that it can be long and tedious, but I see no complaints from him about the process that was used. I have made the bot inoperative, but I'm afraid that it has already done its damage. I will only reinstate it temporarily if someone can use it to undo the damage.

I can and have shown some tolerance for the inflection templates, but it needs to be emphasized that these are entirely optional, and any wholesale attempt to impose them on all articles, or to remove them from all articles should be viewed with extreme concern. If you want to change one in either direction, go ahead, but that should probably only be done at the same time that you are making other more substantial edits to the article. Similar comments can be made for the {{m}} templates to represent m. for masculine gender, and its related templates. It is hard to imagine a more pointless use of a template.

We need to remember that Wiktionary was not devised to be a playground for techies, even though the rest of us appreciate their service in the vast majority of circumstances. The core interest of Wiktionarians is words, not technical tricks. They like to see and edit real and understandable material, not puzzle over templates whose meaning is far from clear. In exchange for this they can accept the burden of having to type a little more than would be needed with a template. There is some value to standards and uniformity, but there is also value to a wiki markup whose total basics can be explained in very few lines. That simplicity has no doubt been one of the important factor in the success of wikis in general. Eclecticology 08:25, 23 August 2006 (UTC)

I for one agree with Ec in finding English Nouns a pointless category, nor do I think any visitors make use of it. However, it doesn't offend me especially and I don't object to it if there are really valid technical uses for it (but are there?). As for inflection templates, I like them more than Ec seems to, but again I agree that casual users should not feel they are compulsory. Widsith 12:12, 23 August 2006 (UTC)

Whether there are valid technical uses for it has never been established. For me the biggest problem is that the two have been merged effectively preventing the category from being removed or changed on any individual page. Eclecticology 10:32, 25 August 2006 (UTC)

I'm sorry but I must make attempt at humor (always difficult via text messages :) Wiktionary is repleat with pointless information and pointless categories! How many people really care about the etymology of the word nunchaku? I have no idea, but let's not get overly pious about useful vs non-useful. Having said that, I agree that it is valid to debate about the inclusion of absolutely ridiculous information. However, I don't think the nouns category quite reaches that level ;)

A-cai 13:37, 23 August 2006 (UTC)

Scs's first response

I agree with just about everything Ec said above. Here's the key place where I think you're wrong, Ec: in imagining that these templates cause any significant problems for the non-techy users who don't care about them.

Yes, the templates are techy. Yes, the automatic categorization tags that might be lurking within them are techier still. Yes, Category:English nouns is useless if you're not a robot. But -- so what? No one's forcing people to use the templates; no one's forcing people to step through each page of Category:English nouns until their eyes glaze over.

I'm a pretty pathetic tech-head myself, and I don't even bother to use {en-noun} (or whatever it's called) when I create a new entry. (I simply don't care.) But I know that Rod or Hippietrail will be along to add it shortly.

I suppose it could be argued that those templates are a hindrance to editors working on existing pages, but in this case, I suspect there's a pretty good correlation between editors who don't care much about (and would be happy to ignore) inflection templates, and editors who don't care much about (and would be happy to ignore) inflection lines at all. (Ergo, they can happily ignore them either way).

It would be good if we could somehow get some input from actual, ordinary, in-the-field editors on this. Do they really care one way or the other? I don't know; all I know is that ivory-tower speculation is unlikely to yield a valid answer. (I'm speculating that most editors don't care; you may well speculate that they do care; I won't argue with you, but I wish we knew for sure.) —scs 14:30, 23 August 2006 (UTC)

I wouldn't speculate either way. I chose not to make this an argument about templates in general, or about whether inflection templates should be used at all. I don't and won't use the templates, but when I choose not to use one I expect that to be respected without someone coming behind me to replace what I have done. Eclecticology 10:32, 25 August 2006 (UTC)

"If you don't want your writing to be edited mercilessly and redistributed at will, then don't submit it here." — Vildricianus 10:18, 3 September 2006 (UTC)

(reply to EC, if it isn't clear) I don't understand why you hate templates so much. As Scs said above: "In particular, there seem to be several efforts to use them to impose additional bits of structure, and enable additional bits of automated processing, beyond what MediaWiki natively supports." (as you quoted). This is absolutely essential for Wiktionary. We are building a very highly structured data base on software that simply does not, without "additional bits of structure", provide what is needed.

Some templates are indeed essential, but when equivalent material can be as easily included with basic wiki markup that template is not "needed". With nouns, whether plain markup or the template is used the result is exactly the same. Eclecticology 10:32, 25 August 2006 (UTC)

Templates allow new users (and old) to get the structure and style exactly consistant, without twiddling every ' and constantly hand-checking that each entry is in exactly the desired catagories. It also, critically, allows us to make changes in structure and style without massive bot runs that can be hard to revert (easy to revert the template), and allow per-user presentation options. Of course we don't demand people use them, but we certainly must encourage it.

As Cunctator said on the WikiEN mailing list, "some degree of consistency is good but too much is the hobgoblin of little minds." Getting structure and style so exactly right is of minor importance. And what is "exactly the desired categories?" If one chooses to subdivide a category amending the template will be of no help whatsoever. If there is real consensus about the bot run who would want to revert it?

This is an aside as it doesn't really address this thread directly. I'm sure you find at least some bots useful, so you'll understand that a lot of very minor details have been standardized that could be confused with hobgoblin though these standardizations were only for the sake of the bot's mechanical output. If anyone runs across these issues, it is a second standard for bots that is not intended to be imposed on contributors. Spacing for instance is negligible and random coming from contributors, but unfortunately consistent and exact from any bot. A grayer area is the standardization of headings, which as you know is vital for languages, key for parts of speech, and standard though more open as the ranking decreases. DAVilla 23:22, 29 August 2006 (UTC)

It would be unrealistic for me to say that no bots should be allowed. When misused bots have a great potential for damage, or if not damaging at least can be a way for someone to impose a particular vision or idea where the community is ambivalent about the idea. Once imposed, such a vision can be very difficult to undo without another bot. In the present circumstances it is very easy to remove the category from the template, but that does not put it back as a simple category in the articles that had it before. We can't easily carry on as if the bot had never been activated. When the current tempest is cleared up I hope to go into more detail about bot approvals, but the process should make it clear that a bot is being requested, and that request should not be mixed in with a decision about the underlying task. If the community does not accept that a process is good if done manually, there should be not question of doing it with a bot. At the same time it should make it easier and clearer to deal with completely non-controversial tasks like the spacing that you mention. Eclecticology 00:42, 30 August 2006 (UTC)

Keep this in mind: the MediaWiki software is not sufficient for this task in the long term; we will have to go to some kind of relational DB-based system (whether WiktionaryZ or something else). When that happens, templated entries will convert automatically. Almost every entry that doesn't use inflection templates, etc, will have to be converted or checked by hand. By using templates now we are making that work easier. It may be it is the only thing that will make it possible. Robert Ullmann 14:38, 23 August 2006 (UTC)

The understanding is that WiktionaryZ will not replace any existing Wiktionary unless the people there want it. WiktionaryZ is still far from being able to do what you want. I'm working for this project, not WiktionaryZ. How can you be sure that they will want the inflection templates? They haven't even decided how they will handle parts of speech. Based on the conversations that I have had with GerardM there is a lot of room for accomodation. There are ways that the MediaWiki software is not appropriate to what we do here, particularly in handling the translations, but in many ways it's doing a much better job than what some techies would have us believe. Eclecticology 10:32, 25 August 2006 (UTC)

To bring this to the original point: you insist that the category doesn't belong in the POS template? Correct? Consider this: with the Category in the POS template (en-noun), and with the template used consistantly, we can:

add the category English Nouns
change the name of the category to (say) Nouns
or delete the category

Each takes a one line change, easily revertable (as you know) to Template:en-noun and a wait of a day or so for the background sweep to update all categorizations (this happens automatically). Without the templates, each would be a massive bot run. See why templates are useful? Robert Ullmann 15:15, 23 August 2006 (UTC)

Again I'm not talking about templates in general but about this template. The changes that you mention are all or nothing. If, for example, I wanted to subdivide the nouns into concrete and abstract nouns, how would you propose doing that? Eclecticology 10:32, 25 August 2006 (UTC)

Japanese verbs are divided into Type 1, 2 and 3, and Japanese adjectives into い or な declined forms, as well as others. Works just fine with or without the templates. When you use the templates, you specify the type or declension. Robert Ullmann 13:00, 25 August 2006 (UTC)

Good, then either way should be usable, and you don't have categories buried in the template. Eclecticology 09:04, 28 August 2006 (UTC)

The templates and categories are useful (e.g. , , using Google), even if EC believes otherwise. They give readers a consistent product whose style we can easily change. All of EC's edits in the main namespace today were reverts of my additions of the headword/pos/inflection templates. That behavior is very counter-productive.

You conveniently fail to note that in that batch of words there were at least some where I had added a plain inflection line where nothing at all existed before, and you had previously reverted that. Please mnke sure you tell the whole tale if you're going to tell any.

Wiktionary belongs to this community, not to EC. His "tolerance" comment is inappropriately authoritative. He is in no position to declare the primary focus of all Wiktionarians as he attempts by saying, "The core interest of Wiktionarians is words, not technical tricks". He is in no position to ban community-endorsed edits made to improve Wiktionary, as he attempts to do by reverting me and saying, "If you want to change one in either direction, go ahead, but that should probably only be done at the same time that you are making other more substantial edits to the article." He is in no position to stagnate our automated cleanup efforts, as he attempted by blocking my bot and decreeing, "I will only reinstate it temporarily if someone can use it to undo the damage."

Better authoritative than authoritarian as you seem to want to be. If a dictionary is not about words, what is it about? What was the "ban"? I understand that your POV is that any approach to edits that differs from yours is not done to improve Wiktionary. My comment about changing the inflection line in either direction was directed to avoid wholesale changes and promote mutual respect. You're confusing automated cleanup with automated dictatorship. The bot was not properly authorized. You made no effort to seek community consensu about using it. Please look at the discussions regarding Connel's earlier bots that were approved.

EC speaks of imposition, but his reverts over the past two days are the only attempts of imposition here. Rod (A. Smith) 16:12, 23 August 2006 (UTC)

I don't see how a handful of reverts can be more of an imposition than a systematic burying your preferences in a template where it will be nearly impossible to change something at the individual article level. Eclecticology 10:32, 25 August 2006 (UTC)

DAVilla's response

On the subject of large categories, there has been a recent change in attitudes due to their potential utility. This is a new development, and despite wanting to add to that conversation I won't do it here.

There's a big difference between a real utility and the potential one that you imagine. Eclecticology 10:32, 25 August 2006 (UTC)

I suppose the plural you is meant, because I've always fought against the creation of large categories. DAVilla 22:33, 29 August 2006 (UTC)

When I am responding to multiple comments in a complicated thread, it is safer to treat "you" as generic. ;-) Eclecticology 22:42, 29 August 2006 (UTC)

On the subject of templates, in particular the speed in which their use was implemented, I think the reason it developed so quickly is precisely because there were no objections to it. Frankly you're the only person in my memory who has raised any flags (granting I've been here only shortly), and others if they're not entirely supportive are arguing the terms, e.g. they're better just don't require me to use them. Yes, there were probably too many techies around, and I'm sorry if you were away, but I don't blame Connel or anyone for not playing devil's advocate with himself, asking what possible objections there might hypothetically be to every little issue beyond the real if few objections raised here, not to mention that many dictionary communities for other languages have already accepted and implemented the use of templates wholescale.

Again we aren't talking about templates in general. There were no objections because no time was given for objections. This is hardly a little issue when at least 10,000 articles were affected. Each Wiktionary sets its own policies. Eclecticology 10:32, 25 August 2006 (UTC)

On the overall issue of burying categories in templates it was noted that there had been question in the past, but that things had resolved now and the conclusion was that it would be easier for a newcomer to use a single template than having to remember the code for styling (which templates have in fact diversified), the names of the specific categories, etc. which were more commonly just omitted. This change came before August, so by the time you saw things moving in high gear there weren't really even any questions to raise. This is meant to address the speed at which things are moving, and not so much to support the conclusions themselves. Certainly the questions can be put to review, but if you really think the bot you stopped "has already done its damage" then I'd hate to see your reaction when, after reviewing all of these issues again, the bot is reactivated under consensus.

Remembering the category is not difficult when there is only one at issue. It's much easier than remembering all the template varieties. Where in the past was ther a consensus for burying the categories in such a template, especially when its a controversial category. Eclecticology 10:32, 25 August 2006 (UTC)

My understanding is that a lot of the templates have been rolled into each other with fairly consistent naming like {{en-noun}} and {{en-verb}}, and where naming isn't consistent a redirect is a completely transparent patch. If this discussion is narrowly dealing with Category:English nouns then, extending arguments above, the templates are more useful in changing the use of that category if, for instance, we decide not to include all nouns in that category (and then, perhaps, change our minds, and twice again). I don't think anyone is arguing that subcategorization shouldn't be allowed, regardless of the template. Of course that takes human intervention, which is expensive, and so the question we should be asking is the best method of subcategorization. DAVilla 22:33, 29 August 2006 (UTC)

I think that the discussion has become more focused on the narrower issue. It is dealing with including Category:English nouns (with parallel arguments for adjectives and verbs) in the inflection templates. Our categorization and sub-categorization scheme is nowhere near to being developed enough as to reasonably allow categories that are so consistent that the use of bots or templates would be a benefit. I believe that categories should be scalable enough that reasonable subdivisions can be applied. They should also be collapsible in that a higher level category can optionally be made to show its immediate members only or its submembers to any whatever depth the user requests. A category that cannot be easily scaled is probably poorly chosen. Eclecticology 00:59, 30 August 2006 (UTC)

Unfortunately the software for categories is not developed to the point you envision, and I can see a snag in trying to do so. In a system like yours a basic simplifying assumption is that the graph structure of categories is a tree with no cycles; that is, no category can contain itself. While this is a sound policy and a fairly easy restriction for our mental construction, the software is not written this way, nor would it be feasible to impose this restriction in the wiki syntax since no error message could be given on such attempt, nor the edit rejected. It would be better to leave out this simplifying assumption, in which case collapsing is possible but probably not in the same folder-style GUI you imagine.

The simplest extension I can think of is this: The basic option we currently have of "entries in this category" would be superceded. For me, "entries in this category but not in any subcategory at any depth" is the most essential option for browsing, and "in this category and all subcategories to any depth" for a random page utility if not both. Of course there could be endlessly many more requested features.

What would be possible to do at present is to modify all templates that include organic chemistry as a category, for instance, to also include chemistry and the sciences, as the most inclusive option. The most exclusive option might also be possible if we restrict ourselves to using templates for categorization, but it would be incredibly rigid, to your disliking, and incredibly tricky, to everyone else's. The third idealized option is the somewhere in the middle laissez-faire: standardized categories where standards allow, plus hand-crafted sub-categorization. This wastes less of our time if in the long run the software we're imagining comes to fruition.

Of course there are many ways to achieve the last option, and here I'm really pushing my own opinions. As many of us see it, standardized categories are easiest to do by automation, because of process, because of consistency, because of the reasons above. Given that the templates would already be there, sub-categorization is also easy to do making use of these templates with an extra parameter. Another option would be to leave the category information in the file, although buried in the template as well, only as a reminder for the purpose of potential sub-categorizations. DAVilla 15:16, 30 August 2006 (UTC)

I'm sorry if I don't follow everything that you say, but this is a fundamental problem for this project between those who attach priority to content and those who attach it to structure. This is not a matter for blame, but it does mean that people are often talking at cross-purposes. If I look at the recent template work by someone like User:Fabartus I find his efforts totally incomprehensible, although it does seem to have something to do with correlating categories and templates between projects. As much as I don't like the inflection templates, at least I can see what they are trying to accomplish. If I don't understand what a template is trying to accomplish, it becomes very difficult to evaluate its implications on a social level. Sorry too if this became a little rantish.

Yes I do see categories in a treelike structure, and I do see each category in the main namespace as being traceable back to Category:*Topics. I'm still ambivalent about whether a category should contain itself, and I haven't at all considered graphical interfaces. Nor did I consider the use of error messages, by which I presume to include something like, "The category you have chosen does not exist."

Whatever you think I said, ignore it, practically all of it. It's not really your fault either. I actually have a difficult time getting ideas across even to other tech-savvy people because I jump right into them rather than tediously describing the boring backdrop, which I think should be obvious. For you I think I'll back up an additional step. Here's what I was trying to say:

You suggested that categories might some day have extra utility. I was analyzing what kind of utility they could and could not have in the long term so as to help determine the best policy in the short term. My conclusion was that the software would be more difficult to write than desired, (especially for a lazy programmer like myself,) but still feasible. Therefore categories do have a lot of potential use. So, how do our actions in the short term affect that future use? (By the way, a clean tree-like structure of categories I would take as a given since it's actually difficult to construct counter-examples in our minds, but ask me if you want one.)

Anways, as it pertains to these megacategories, there's no problem with listing a word in both the category and in subcategories. What will happen down the road is that the subcategorized terms will be automatically struck out from a view of the higher-level category itself. Thus if a particular area such as Chemistry is very well sub-categorized into Organic etc., then despite having several hundred thousand terms in the actual Chemistry category itself, fewer than 200, potentially, will actually appear as being "in this category but not in any subcategory at any depth".

By the way, the best reason to populate the various Parts-of-speech categories would be precisely what you'd want, to catch all of the words that fall through the cracks, that are not yet sub-categorized or under unique circumstances (there are always exceptions) could never be sub-categorized. The reason you're so abject to the category buried in the template is that you're trying to sub-categorize now when the tools to help you do that are not yet available. But really the number of categorizations added by the bot must be several orders of magnitude greater than the number of words that have part-of-speech subcategorizations, so I don't see how this would impede your work for quite a while. DAVilla 17:38, 31 August 2006 (UTC)

Some kind of standardized categories could be acceptable, but are we anywhere near to establishing those standards? Thus far there have been a lot of ad hoc categories where I don't think that very many people have thought out how their category of the moment fits into the structure. A thesaurus is a kind of structure. I also refer to my copy of the Library of Congress Subject Headings. (nearly 7,000 pages with 3 columns each) The latter suggests 14 subdivisions for "Organic chemistry" which is one of 39 for "Chemistry" alone. I doubt that we would ever use them all. ... and concrete nouns are the easiest to categorise.

I don't completely discount the possibility of automation, but one huge and highly debatable category is not the place where we should be looking for consensus at this stage. Nor is it even decided that categories will give us the best way of organizing the data. We have indexes and appendixes; we have Wikisaurus which deserves more attention than what it gets when someone discovers 5,000 synonyms for "penis". What will be the relation between categories and the Wikisaurus.

In general I like to use my seniority on this project to look for common ground that brings things together, and to try to build a global view of Wiktionary. I make no apologies for insisting from the beginning that the project include all words in all languages. That alone has opened some incredible and unique challenges which, if overcome, will lead to unparalleled results. We can't let technical solutions run ahead of the problems that they purport to fix. Eclecticology 22:24, 30 August 2006 (UTC)

I can't find fault with this assessment. It looks like we're all looking at the same problems from different angles, which can only be constructive. I don't think the megacategories are trying to tackle the issues you're raising though, or if so barely half a step. It might be easier for us to reach the same conclusions considering the motivation to be a do-what-we-can-now-to-have-the-most-impact sort of attitude. The difference is that you consider the potential utility of unproven modifictions shaky and worry it will all have to be undone, while the rest of us have supreme confidence that it is leading in the right direction. DAVilla 17:38, 31 August 2006 (UTC)

I'd like to bring some of these issues to vote. Certainly not all of them have been discussed, but a number have. I take objection to the use of "only" in your statement that "burying this categorization...can only be viewed as a way of making sure that that category remains the same." That is not the only opinion of the issue. The point is not to make the category inalterable, it's to make the category name consistent, to minimize the potential for a mess that requires so much effort as to evade all but the most massively organized attempts. Even if the category seems unrelated, provided any page with that template should necessarily belong in the category then it's okay by me to put it there. This ensures that the categories are populated with fewer omissions, and in actuality makes it many times easier to alter the name if necessary.

Maybe easier to change overall, but clearly more difficult to sudivide the category or to use a different category for certain items. Eclecticology 10:32, 25 August 2006 (UTC)

The use of a different category would indicate that "any page with that template should necessarily belong in the category" does not hold, in which case I would oppose categorization by template. As to subdividing categories, that needs discussion. The clean techie solution is to pass a parameter to the template, or to another template such as {{en-pos|noun|color|verb|phrasal|irregular}}. The dirty techie solution is to append a term to the template name. The third option is more viable when there isn't a consistent structure, where uniformity is not necessary and therefore not desireable. DAVilla 22:33, 29 August 2006 (UTC)

I am therefore also opposed to the replacement of a template on any page if there are no corrections made. I don't know how to use all the templates that are provided and I don't think anyone should be expected to. If there's another plural form then any way a contributor finds to add it should be sufficient. But when and where they work properly templates should be preferred, and so I would oppose any effort to undo them just for the sake of undoing them. DAVilla 15:59, 23 August 2006 (UTC)

Okay then accept too that the same should apply when people have not used templates to start with.

I absolutely agree that making and modifying entries should be as easy as possible for new users and for users who are more interested in the substance. Even for regular contributors, who are expected to follow some basic guidelines, we don't want to impose too many regulations. Even the templates themselves should be fairly straightforward so that they can be modified at later dates. Even the trickiest templates, however, are not necessarily difficult to use, and I would be very interested in discussing conventions for templates like {{italbrac}} which is not clearly linked to synonyms by its name. DAVilla 22:33, 29 August 2006 (UTC)

Connel MacKenzie's response

For questions of process I wish to say a few things:

Regular contributors have no excuse for ignoring WT:GP, especially when they are very well aware that it exists. Yes, the more important issues are brought to WT:BP (if it is clear the topic is "policy"-ish and not "techie"-ish.)
Regular contributors should not be required to wade through a lot of technobabble that they can't understand. Eclecticology 10:32, 25 August 2006 (UTC)
You clearly understand it and have no reason for disobeying the community consensus decisions reached there. --Connel MacKenzie 18:27, 27 August 2006 (UTC)
I may very well have understood this one, but that doesn't mean that I will understand every techie argument on that page. Eclecticology 09:04, 28 August 2006 (UTC)
The fundamental complaints Ec has raised are all "techie"-ish, and belong on WT:GP, not WT:BP.
Whether Categories should be included in noun-templates is not a techie issue, but how to do it is. Eclecticology 10:32, 25 August 2006 (UTC)
Your complaint was about the "how." Furthermore, all the extended uses of categories are techie issues, therefore, according to you, should only be discussed there. --Connel MacKenzie 18:27, 27 August 2006 (UTC)
If they are merely techie issues they should not affect what non-techies do iin any way. Eclecticology 09:04, 28 August 2006 (UTC)
Claiming that discussions were inadequate is rather silly. the concept of using the category has been tossed around for over a year.
And still mostly unresolved; so what if it's tossed around for another year? Eclecticology 10:32, 25 August 2006 (UTC)
Completely untrue. The English Wiktionary community has resolved the issue, to everyone's satisfaction (except for one person who chose not to participate in any of the numerous discussions, until now, after-the-fact.) --Connel MacKenzie 18:27, 27 August 2006 (UTC)
Some who have commented don't necessarily see this category as a good thing; they just didn't feel like arguing about it. Even conceding that there have been numerous discussions on this does not mean that any of those discussions are conclusive. The more there are the less certain the alleged decision. And why should any decision be so final. Maybe we need to review the decision making process so that it becomes perfectly clear to everybody when a decision has really been made. A casual understanding among those who happen to be around to comment on BP does not seem enough to commit a whole community. Eclecticology 09:04, 28 August 2006 (UTC)
Jumping to a random entry within a category is a feature of navigation that has been requested numerous times, but remains disabled currently.
References for those requests? Eclecticology 10:32, 25 August 2006 (UTC)
Wikimedia IRC channels are not allowed to be publicly logged. --Connel MacKenzie 18:27, 27 August 2006 (UTC)
Ahhh! That proves my point. There is no evidence. Eclecticology 09:04, 28 August 2006 (UTC)
100% false. --Connel MacKenzie 17:16, 31 August 2006 (UTC)
Categories have other uses than navigation. Still. One way to use the categories would be to generate an appendix (but then, the other main Bureaucrat Paul has suggested the opposite several times; that appendices should be migrated to categories!) My opinion is that both are useful for very different reasons.
Evidently more discussion of this is required. Eclecticology 10:32, 25 August 2006 (UTC)
Because of one after-the fact objection (that ignores things like dynamicPageList and other extensions?) A category is not an appendix; they serve very different functions. --Connel MacKenzie 18:27, 27 August 2006 (UTC)

Templates:
1. Previous "template wars" were primarily over appearance. With the initial "preferences" page the proof of concept has been demonstrated. Lower-level MediaWiki changes are being discussed (mainly on irc://irc.freenode.net/wiktionary) and seem quite likely.
2. Templates need to be encouraged strongly. Consistent, parsable entries are impossible in the free-text scheme; we've seen that demonstrated for years, here, now.
  - This reflects a change of opinion, for me. While I strongly supported having text inflections in the past, the new features of the templates (making their appearance cater to user preferences) outweigh my previous concerns.
3. The natural place to correctly categorize entries is in templates, to make (as Jimbo said at WikiMania) the process easier for newcomers.
  I don't recall his saying anything about putting categories in templates, and I would not draw that inference from his desire to make things easier for newcomers. Nevertheless, he has previously expressed concerns about instruction creep. Eclecticology 10:32, 25 August 2006 (UTC)
  Your primary objection so far has been the inclusion of the category (in the appropriate template!) Which inference do you dispute? That Wikimedia edits should be easier for newcomers? --Connel MacKenzie 18:27, 27 August 2006 (UTC)
  Your rhetorical question distorts the situation. Making things easier for newcomers to edit is not the same as making it easier for them to comply with your vision of the way things should look. Eclecticology 09:04, 28 August 2006 (UTC)
4. In combination with further preload-automation, the templates will become only easier than they currently are.
You've missed the conversations on 'bot policy reform, it seems. I do not think the bot approval was a mistake by a new bureaucrat. Rather, it was a reflection of the change of practice with regard to the pointlessly onerous bot policy.
The bot policy should be onerous. Eclecticology 10:32, 25 August 2006 (UTC)
The opposite is true. --Connel MacKenzie 18:27, 27 August 2006 (UTC)
Your out-of-process removal of the 'bot flag needs to be reverted, if for symbolic reasons only. As you said, the "damage" has already been done.
Months have passed since 'bot reform was first discussed. Nothing was rushed into. To not participate in the discussions, then complain about the results, seems quite strange. FWIW, 'bot reform has been tossed about (and needed) longer than Category:English nouns.

--Connel MacKenzie 19:55, 23 August 2006 (UTC)

On the last point, I will review the links and comment later. Eclecticology 10:32, 25 August 2006 (UTC)

Please do. The point is not to be adversarial; having you start wheel-warring because you don't "like" a certain category (that the rest of the community has agreed is not only good, but necessary,) is very counter-productive. You have a lot of information retained about categories that can really help how CJKV categories are ultimately arranged. But I don't see any reasonable objection to Category:English nouns. --Connel MacKenzie 18:27, 27 August 2006 (UTC)

The first reference had to do with approval of a different bot. I saw it mentioned in the bot approval log, and chose not to argue about it. The other raised the question of whether a bot should be restricted to a single task; I have no problem with that though I think the low participation rate makes the result inconclusive. Sometimes your claims of community support are on shaky grounds, and it is often not very clear as to just what the community agreed to. Furthermore, please don't misrepresent my position as an argument that is simply againt the English nouns category; it is to exclude the category from the inflection template. Eclecticology 09:04, 28 August 2006 (UTC)

Scs's second response

Wiktionary (like any Wiki project) is many things to many people. Some just want to write prose for other people to read, while some want to create a structured database that can be used or extended in more "interesting" ways. The challenge, as ever, is to arrange that the various parties can work together in harmony (or simply ignore each other), despite their divergent motivations and goals.

I think it's important to recognize (though I'm admittedly quite biased here) that the goals of imposing more structure and enabling automatic processing are important. Not everyone is interested in that structure, but plenty are. More importantly, I think it's safe to suggest that the future of Wiktionary depends on it. The current Wikimedia software is woefully inadequate for what we're ultimately trying to do here. Someday, Wiktionary will want to migrate to something that's truly more structured (that is to say, inherently more structured, not "structured" by manually-maintained ad-hoc consensus). If that eventual migration is to be anything like successful (or possible), without requiring massive amounts of manual labor then or abandonment of massive amounts of today's content, we've got to do what we can today to maintain some uniformity and structure.

The question, then, is how to maintain the balance between those who are keenly interested in this structure already versus those who are not so interested, without imposing a lot of, as Ec puts it, "technobabble" on the latter. To that point, my attitude is similar to Wikipedia's infamous "ignore all rules" policy: if the templates are too complicated to remember, or if they change every week, I just won't use them. But where Ec and I part is that I do not mind if someone comes along and "fixes" my inflection lines later. Because, remember, they're not "my" inflection lines; they're the project's. I was not making some value judgement when I chose not to use the template; I was just being lazy.

The next question (for the current inflection-line and noun-category question, anyway) is whether there's any loss when a non-templated inflection line is replaced by a fancy, technical template. Did the editor who initially declined to use the template do so because

He was a lazy bastard like Scs
He wanted a different look and feel to the inflection line than the template would give
He wanted to ensure that the noun in question did not appear in Category:English nouns?

Obviously (1) is no problem, but if there are non-inflection-template-users who are trying to assert (2) or (3), we've got a problem.

Personally, I'm not too interested in look and feel either way. If the consistency we were worried about here were merely visible, it might well be (as Emerson put it and someone quoted Cunctator as paraphrasing) somewhat foolish and perhaps a hobgoblin of little minds. But the consistency we're most worried about (or at least I think we ought to be most worried about) is not merely visible.

Cunc did not credit Emerson. Thanks for pointing this out. Eclecticology 09:04, 28 August 2006 (UTC)

The automatic uses of category-based processing that people have in mind work only if the categories are properly comprehensive. When the people doing that processing notice that a noun is "missing", they're going to need to add it to the category, presumably by changing to (or reverting to) the category-containing inflection template. So, on this point, we do need wide consensus: if the categories are to work for these purposes at all, everyone has to agree that this is a good idea. We would need to discourage people from changing inflection templates back to free-form inflection lines, or otherwise removing words from their part-of-speech categories.

(Sorry this was so long. I'll stop now.) —scs 14:12, 25 August 2006 (UTC)

You are right in distinguishing between those of us who see content as important, and those who see structure as impportant.

We also differ in terms of migrating this project to something else. Where do you see us migrating? For the things that matter to me, an historical and verifiable record of language that also discusses the subtleties of linguistic change. I do find the software generally adequate for my purposes, except as it relates to the long tables of translations, where I don't participate much anyway. It could very well end up that WiktionaryZ could absorb that part. Shouldn't we know where we are migrating to before we start planning for it?

"Ignore all rules" is one of the more progressive policies on Wikipedia, but it cuts both ways, as is "be bold". "Fixing" can work both ways.

Of the three enumerated reasons for resisting the templates, the first is clearly trivial, the second is not applicable because identical results can be achieved with or without templates, but I do assert the third. The process is completely inflexible. It does not allow for more refined categories which could themselves be subsets of the English noun category, and might themselves be picked up in an "include subsets" extension. There is already at least one entry which should be marked as a "noun phrase", but is already categorized is a noun.

Where do we distinguish between nouns and noun phrases?

ice cream, guinea pig, legal tender, milk of magnesia, high spirits, prisoner of war, batchelor's son, palm tree justice, apples and oranges, law of diminishing returns, one of his majesty's bad bargains

DAVilla 22:43, 30 August 2006 (UTC)

I don't know what kind of automatic processing anyone has in mind, and without knowing that it is impossible to know whether there are any alternative ways of accomplishing the same thing. Even if there is some need for processing based on this category, why should it be a part of the inflection, when having it separate would do just as well and be more flexible. I cannot agree that it is a good thing to have them together.

Certainly diligence in approving bots has its place. Nobody here questions that we should evaluate and test bots before turning them loose in the database, but that is already part of the process. To make the process deliberately "onerous" or "tedious" and to leave them explicitly in administrative limbo for a wikiternity while we ponder (and then ignore) their appropriateness is worse than to reject unwanted bots outright. If the process works as it should, then a bot that is truly inappropriate to the needs of the community will be stopped at the door in a timely fashion, and the volunteer programmer who proposed it may simply shelve it and proceed to another project. It can be revived later if needs change. The appropriate ones should be approved in an equally timely fashion, to improve the conistency of articles and spare us all the tedium of making sweeping, repetitive changes manually.

The programmers and the "inexperienced" bureaucrats who approved these bots are not newcomers to Wiktionary, nor to programming. They are not some interlopers out to destroy the project. Rather, they are trying to improve matters, with the general support of the community. I believe we may trust them to keep an eye on one another and to run and maintain bots in a reasonable and responsible manner. Let us write the policy with reasonable time limits and safeguards, and then use it. —Dvortygirl 05:43, 26 August 2006 (UTC)

I certainly support writing the policy in a way that does this. The other policy where an evident need is one that states when a policy is adopted. Casual agreements in the Beer parlour among those who happen to be around should not mean that the policy in question has been adopted. A discussion begun here a mere week ago will already be difficult to spot when other unrelated discussions begun later have become lengthy. Adopting policies by default is also not the way to go; a minimum level of support within a reasonable time frame should be essential. Eclecticology 09:04, 28 August 2006 (UTC)

Widsith's response

Some observations:

Ec should not be criticised for objecting to decisions ‘after the fact’. There is no fact; previous decisions are not set in stone; the consensus opinion is open to continual debate and revision, depending on the introduction of new users or indeed changes of opinion.
The output of templates may look the same as Wiki markup for Ec and others, but it looks very different for (eg) me, because I have set my .css to display inflection templates as boxes. This is the beauty of them. If templates don't look any different, there shouldn't be any objection to their use – except where it affects the next point.
Clearly (from this discussion), templates are much more widely accepted than the use of Category:English nouns, and we should perhaps separate the two discussions.

Widsith 18:53, 28 August 2006 (UTC)

Your first point is very good. Although there are some things we clearly need to discuss, anything can be brought up for re-discussion. Your second point is great and good to state clearly, although I'd have to say that the customizations have more an effect of quelling disputes than their effect on nearly all users who are not regular contributors, at least until the mirrors take advantage.

Your third point is also excellent. For one, we need to decide if any noun belongs in Category:English nouns even if it's in a subcategory. We also need to decide what tools, if any, will help us better categorize nouns. And probably a few other things. DAVilla 23:05, 29 August 2006 (UTC)

It's always easier to find common ground when someone is not trying to defend a past action. On the second point there seems to be some understanding that there will be no wholesale attempt to replace all the templates, but for a wide variety of reasons some will need to be replaced at the level of the individual article. The outstanding question has been narrowed to one of how to deal with a certain class of categories that are now in templates. Eclecticology 01:16, 30 August 2006 (UTC)

scs's third response

Wow. This has certainly turned into a discussion!

We've danced around (if not trampled into the ground) at least three separate questions, and I'm not sure which ones people still find open and/or interesting:

Should Category:English nouns exist?
Should category tags be buried inside of templates, in general?
Should inclusion into Category:English nouns be buried inside of {en-noun}, in particular?

To that third point, Ec suggested that "Burying this categorization in a complicated template that has nothing to do with categories can only be viewed as a way of making sure that that category remains the same no matter what anyone else thinks", and this was quite an illuminating comment, because I had no idea it could be viewed that way. (So I think the word "only" needs to be stricken from that sentence.)

Honestly, I don't believe there was any POV-pushing agenda here; the idea of tying noun categorization to that template was a result of several separate, individually-rational-seeming steps:

It's potentially useful to have a list of every English noun, and
A category is a good way under the current Mediawiki software to generate such a list, but
It's a nuisance to manually categorize every noun; people will tend to forget, and then our list won't be complete. But also
I's equally a nuisance to format inflections lines consistently, so
It'd be good to have a template for that. Finally,
Since the entry for every noun will obviously want to use the template to generate its inflection line conveniently and consistently,
We can just put the categorization tag inside the template invocation, and
Hey, presto, we automatically get a much more definitively comprehensive list of English nouns, with much less tedium and duplication of effort to maintain it.

Now, it's true, there are a few assumptions lurking in there: (1) that a list of all the nouns is useful and worth all this; (2) that "the entry for every noun will obviously want to use the template to generate its inflection line conveniently and consistently"; (3) that putting the categorization tag inside the template invocation is a "we can just do it" thing. To the gearhead, all three of these are just blindingly obvious; the possibility that anyone could feel otherwise just doesn't even register. But this discussion reminds us that there are more things in heaven and earth than are dreamt of in the gearhead's philosophy.

Up above, Ec asked, "Shouldn't we know where we are migrating to before we start planning for it?" My answer is, actually, not necessarily. This is, I strongly believe, a case in which even if we have no idea where we specifically might end up some day (and it's true, we don't have any idea), there are yet several points on which we can make some very accurate educated guesses.

In particular, I think everyone who is interested in structure would agree that the thing we migrate to, whatever it is, will have more definitive ways of specifying the vital properties of an entry. Regardless of how it's implemented, it will almost certainly have the flavor of

word: dirt

language: English

p-o-s: noun

...

word: schmutzig

language: German

p-o-s: adjective

...

Now, even under a more structured scheme, definitions and other "prose" sections would certainly remain much more free-form, which is why I didn't try to illustrate them in my example. But the point here is that vital properties like "language" and "part of speech" are specified in a structured, formal way, as opposed to the current Wiktionary scheme, in which both of these properties are indicated only by the presence of various level-2 or level-3 headers, which any editor is free to leave out, or rearrange, or spell inconsistently, or whatever.

In other words, right now we tag language and part-of-speech in an ad-hoc way, enforced only by consensus. Under a more-structured scheme, the specification of these properties would be inherent, and it wouldn't even need to be "enforced", because it simply wouldn't be possible to not have the language tag, or to spell it wrong.

(At this point, I realize, all sorts of secondary questions and objections spring up: how does the hypothetical database structure tie multiple senses in multiple languages to one "word"? What if you've got a word in an obscure language that isn't on the official list yet? What if a word is in more than one language, or is a symbol that's "interlingual?" Those are questions we'd have to answer in actually setting up the hypothetical much-more-structured scheme, but I think they do have answers. Let's not get sidetracked by them now.)

Anyway, the whole point of this discussion is that we'd like to do what we can, today, to arrange that the eventual migration to some more-structured scheme can be at least somewhat automated. So, for those word properties that the eventual scheme is likely to treat as inherent -- such as language and part-of-speech -- we'd like to start simulating that "inherentness" today; we'd like to find ways to remove the ad-hocness and free-formness from those attributes and the ways we specify them. And we'd like to do this in some way that combines both convenience and accuracy. If our mechanisms aren't convenient people won't use them, but if they're not accurate they won't work and the eventual conversion won't be automatable and so it's all a waste of time.

So the next question is, how "convenient" can these mechanisms be? Up above, Ec said, "simplicity has no doubt been one of the important factor in the success of wikis in general." This is very, very true, and not to be overlooked or brushed aside. But we're at an awkward spot here because we are definitely not, on a couple of points, as simple as Wikipedia is. We've got a particular, highly stylized entry layout which is very nearly mandatory. You can't just come over from one of the other wikis and bang out a "dictionary definition" using any old level-2 and level-3 headers you like. Well, you can, but one of the regulars will be along shortly to clean up your offering and structure it according to WT:ELE.

So what about those templates? Well, yes, they take a certain amount of additional effort to learn about, and to remember to use. No, they're not as easy as just banging out a free-form inflection line. (But of course easier still is not banging out an inflection line at all!) So any decision we reach on template usage is necessarily going to be a compromise.

I believe that our attitude towards inflection-line templates (and other uniformity-assuring templates) should be very similar to our attitude towards the proper Wiktionary usage of level-2 and level-3 headings. You're free to ignore them if you haven't learned about them yet or can't be bothered to remember; you're free to bang out any free-form prose you want to in a new entry you're composing. But, anyone else is just as free to come along later and rejigger the entry to use them. When this happens, it is not some POV war which the original writer should be incensed by, it is rather an improvement which the original writer should be thankful for. It's a win-win situation, really: you get to be lazy and not use the "official" templates or layout structure (and I don't even mean this pejoratively; laziness can be a virtue), but we get the additional structure which the larger project needs, and we all get to share the newly-added information.

Finally, to a couple of other points:

Up above, Ec said, "Once imposed, such a vision can be very difficult to undo without another bot. In the present circumstances it is very easy to remove the category from the template, but that does not put it back as a simple category in the articles that had it before." But this is really no objection, because there's no need to assume the nonexistence of that other bot. If we decided that Category:English nouns were to be kept but that manual category tags were the way to go, we could and would use a bot to re-add the tag to every entry that had used the template.

Explicit in my suggested additions to WT:BOT is a statement to the effect that bot operators are expected to roll things back (via a new bot if necessary) if later consensus decides that the bot-imposed changes were unwanted. (If we can't be reassured by that contingency, we really can't afford to use bots at all, for anything.)

Finally, one last observation about templates. Up above, Ec spoke of the {{m}} and {{f}} templates and said "It is hard to imagine a more pointless use of a template." I have to disagree: for me, anyway, this is a perfect use of a template, in fact it's hard to imagine a better one! I don't want to remember whether we like to use m and f versus masc and fem; I don't want to remember whether they're in italics or bold; I don't want to remember whether a period goes after them. By using a template, I don't have to remember any of these things -- and that laziness alone is enough for me to want to use the template. As a bonus, {{m}} is exactly as many keystrokes as ''m'', and fewer than ''m.''. As another bonus, having used the template, we can change our minds about those formatting details later. As another bonus, we can rig it up so that different viewers see the gender tag differently depending on their preference. As another bonus, we can use the presence of the template as a definitive tag for (say) generating lists of masculine and feminine nouns, or driving the eventual transition to a more-structured database in which this property, too, is inherently and definitively specified. Even if I weren't a gearhead, I think I'd still have to like these templates.

--scs 16:06, 1 September 2006 (UTC)

Vild's response

Since it was I who proposed the concerned category be included in the template, I need to comment here. Reading the above statements, I think I can summarize that there are two opinions regarding Category:English nouns.

Eclecticology states that its classification should be more refined where possible, much like the structure employed by the topical categories. That means that an entry should not be included in English nouns if it can be included in a more specific category.
The proponents of English nouns state that, regardless of any more specific classification, all English nouns should be part of Category:English nouns.

I think this clearly states the main point of contention, and someone mentioned a vote to establish which assertion has the majority of supporters. I'm not pro-votes, though, and I'll simply state why I personally adhere option 2:

Categories are not aligned in a tree-like structure, but broader categorization can coexist with a narrower lineup. The reasons why a broad category should be kept in place even though the classification is at the same time more specific, have both technical and non-technical roots. The former is more important, since the latter only comes down to getting a list of English nouns. But simple technical tricks and use of the available tools (see m:DynamicPageList) provides a lot of opportunities for the concerned category. They can generate specific lists, filter Recent changes, allow automatic operations, and so forth. For example, give me all English nouns that are also verbs, or chemistry terms, or that are derived from Russian, etc. Some may question the usefulness of this, but then, some may question the usefulness of a free dictionary as well. A disadvantage is, apart from the cluttering of an otherwise mostly ideologically conceived categorization scheme, that a given page may take many categories. That will already be the case on multi-language pages, though, and is inevitable in many other cases as well. This should be solvable or customizable on software level.
I also assert that Template:en-noun is to this day the best solution for the layout and treatment of English noun inflections. I find it hard to think of disadvantages here, the only one being that one needs to spend about two minutes looking for how to use it. Is it even two minutes? With some additional clarification and explanation of the documentation, it may be a lot less. Template:en-noun is, whether you like it or not, both a structural and aesthetical improvement over the previous stuff that cannot be denied. And still everyone is free to ignore it, so where's the problem?

— Vildricianus 13:11, 3 September 2006 (UTC)

The English Index - a proposal

There have been discussions in the past about deleting Index:English. Certainly it has lots of rubbish in it, and it is far from complete. We also have many other lists of words that need to be added (see intro to Wiktionary:Requested articles:English for example).

I propose that its contents be replaced with a list of all those English words that we actually have, making it a true index. I'm sure that one of our computer-literate people could generate it automatically from all those entries that have a ==English== section, possibly limiting it to single, or hyphenated words (so as not to include proverbs and the like). It could be updated from time to time, as needed.

What do you think? SemperBlotto 10:46, 30 August 2006 (UTC)

What is this index for, or should it be for, anyway?

If it's where a reader might go to "look up a word", it's clearly nonsense and an utterly unnecessary relic, an inappropriate holdover from paper-based dictionaries. In an electronic dictionary, the right way to look up a word is to use the search box. If we want to make it easier on readers who aren't sure how to spell the word they're looking for, one way (rather than spending time maintaining a redundant index) is to work at improving the search function's ability to do fuzzy matches.

If we want to make it possible to scan a list of nearby words (which some readers may well want to do anyway, especially if they have no idea how to spell the word they're looking for), we could do that by... oh, lookit that, it's there already: the "search from here" link on the search results page (which, to be sure, we could also stand to spend some time improving).

If the index is supposed to be a downloadable list for users who need such a list, it clearly needs to be in a different format (i.e. as one big downloadable list, not with all those "user friendly" subpages and intermediate headings).

But if there is a desire to have such a list, then yes, clearly it cries out to be generated automatically. Trying to maintain it manually is a preposterous duplication of effort, a big waste of time, and guaranteed to be perpetually out-of-date.

Finally, however, if a key aspect of the English index is that it's limited to English words (as opposed to all the foreign words which the English Wiktionary somewhat paradoxically contains), then maintaining it automatically is at least a little bit problematic, because of the ad-hoc ways in which we currently tag languages. (See threads elsewhere, e.g. the huge one on #Category:English nouns up above.) (But yes, I do know why the foreign-words-in-the-English-Wiktionary paradox exists.) —scs 11:24, 30 August 2006 (UTC)

There's no doubt that the English index is a little klunky, and it will be perpetually out of date. Automated revision will update it for the words that we have, but will do nothing for the words we don't have, and which should appear in red in the index.

Requested pages in one form or another tend to be a favorite of people who come with great ideas for work that other people should be doing. This exhausts them, and they disappear. A better idea might be to merge all of these "requested pages" lists into the English, and where needed subdivide the index pages.

At worst, the index pages are harmless since they do not interfere with the work of others. Eclecticology 20:39, 30 August 2006 (UTC)

This may sound like heresy, but I still find paper dictionaries charming. The delight of paging through words and finding a word I'd forgotten or never knew is, well, nice.

(Oh, hey, no argument there at all -- I feel exactly the same way! —scs 14:09, 1 September 2006 (UTC))

That curiosity is what I was hoping to duplicate with the Gutenberg page rankings. Some have expressed that the "rank" things tend to do the opposite - in that they cause confusion because they aren't related at all.

Index:English does seem to be the "best" place to consolidate the various request lists (as Eclecticology suggests) in addition to the terms we have already. Unfortunately, it is a larger task than perhaps people realize. The Gutenberg concordance thing (alone) has always been too large to be meaningful. Each list that would be automatically added to the index would need its own criteria.

For example, looking at the current English Requested Entries, that list seems to be done...the only entries remaining there are typos or "protologisms" that are not likely to meet our criteria for inclusion.
The Gutenberg cutoff could be either 100 or 1000 hits (in the entire corpus of Project Gutenberg texts) depending on how archaic/picky we want to be. (In response to Ec's barb: yes, I do use my generated lists of top 1,000 to enter terms here - others sometimes help, if they feel like it.)
The list of three major dictionaries sorted together remains something that I do not wish to touch, as I feel it is copyright-tained/unnecessary exposure. And my concerns about that lists' legal status have never been countered.
The Project Gutenberg copy of Webster's comes with a copyright notice. The other version of it (without the odious copyright restrictions) is only the first one hundred pages. I don't know the status of the (even more outdated) Century Dictionary.

These are each pretty significant problems.

So I think SemperBlotto's original request remains the most reasonable. Having the Index rebuilt on a regular basis will provide the foundation for the enhancements that User:Scs suggests above. Especially when the entries themselves need to be inspected to determine language. --Connel MacKenzie 00:35, 1 September 2006 (UTC)

I've been doing some of my own work at extracting word lists from various corpuses (corpii?),

You're looking for "corpora". — Vildricianus 09:51, 3 September 2006 (UTC)

and the impression I'm getting is that strict hitcount limits don't work very well. If you set the cutoff high enough to filter out things like one-time personal names some author has invented, you end up missing plenty of interesting, truly-worth-of-inclusion, real words which just don't get used that often. For example, I see from Concordance:Holmes_G that gasogene -- clearly an interesting word! -- is only used there once.

So I'm afraid there's no substitute for a lot of hand filtering and subjective decisions.

On the question of copyrightability of word lists, that's an interesting question. If I take a published, copyrighted dictionary and strip off all the "interesting" content -- definitions, pronunciations, etymologies, everything -- is the resulting list of headwords still copyrighted, and is it a violation if I use it in building my own free dictionary? It's tempting to argue (and plenty of people would) that such a list isn't copyrightable, but as we've just seen, since a fair amount of work can go into selecting a set of words that's "interesting" enough to go into a dictionary of a certain size, I think there's a certain amount of intellectual capital left in such a list.

(But what if I take that list, mechanically diff it against the complete list of words I've already got in my free dictionary, and manually scrutinize the delta for candidates I might want to add? I think I'm on much stronger, pretty much unassailable ground there.)

—scs 14:06, 1 September 2006 (UTC)

I don't like the idea, especially in the long term. A list of existing words is what Category:English language is for. That has all the mechanisms in place to keep it up-to-date, although it can't be browsed as an index as of yet. And I would extend this argument for any Index: page that purportedly shouldn't have any red links. Long-term, I don't see the Index: space as being very useful, or at least not in this capacity. Short-term, do what you like. I'm certainly for deleting or overriding the current content, as per Connel. DAVilla 03:22, 1 September 2006 (UTC)

Category:English language has never been for listing all English words. In a sense Category:*Topics does that with an implicit "en" sub-namespace. Long term no index should have any red links, but when that happens we might as well all pack up and do something else because Wiktionary will have been completed. Indexes and categories are where the top-down and bottom-up perspectives on sorting data interface. If you're thinking of Wikipedia instead of Wiktionary read "lists" instead of "indexes". The interface is not perfect until the two give the same result. Thus far this only happens on close-ended lists like "Days of the week" (at least as we commonly understand that term in English; who knows what would happen if someone invented a metric calendar).

If you're responding to me then your logic is a bit backwards. I never said that indexes shouldn't have red links; that was SemperBlotto's suggestion for this index. I said I don't like the idea precisely for the reasons you gave: a page with no red links isn't an index (unless the project is complete in the ultra-long-term, if that's even possible).

My vote for what to do with this page isn't very strong. It seems like what's there now isn't very useful, so I would like to see it either deleted or changed under some proposal, as per anyone, even SemperBlotto if nothing else can be agreed to.

By the way, is there any word that shouldn't be somewhere in some subcategory of Category:English language? At minimum I would think they all at least have a part of speech. Remember I'm talking long-term. Maybe I should have said that "a list of existing words is what Category:English language will be for." DAVilla 14:28, 1 September 2006 (UTC)

While I've always been happy to shoot barbs at Connel I wasn't aware that I had done so in this thread. I've consistently supported the Gutenberg concordance, and feel that an expansion of that has enormous possibilities. I also am well aware of the tremendous amount of work that would be needed to have that doing what I would want it to do using statistical techniques. At some point I believe that we should have an article on every word in that Gutenberg corpus, but there is no priority to such depth. The three dictionary merger may be difficult, but not because of copyright. The presence of a copyright notice is not what makes something copyright. If that right exists it does so with or without the notice. If you read the copyright notice attached to the Gutenberg Webster you will see that it is clearly stated that it is in the public domain; their copyright applies only to the "small print". Copyright in dictionaries, even modern ones, is trickier than for other writings because most individual entries may not be copyrightable. This is partly because much of what is contained repeats earlier editions which have since gone into the public domain, and partly because of the merger principle, which states that if the information can only be expressed in one way that expression is not copyrightable. See http://www.law.pitt.edu/madison/copyright/supplement/lexmark_v_static_control.htm Eclecticology 08:46, 1 September 2006 (UTC)

Having given this a little more thought, I think we'd be served well by having three (or more) separate English indexes. Index:English, Index:English (including forms and idioms) and Index:English (with redlinks). More suitable names could be chosen, but I think you get my drift; use Wikipedia style disambiguation to separate the lists. The first would not have "plural of" entries, hyphenated entries, entries with punctuation, idioms nor redlinks...just proper entries one would expect to find in a basic dictionary. --Connel MacKenzie 19:09, 5 September 2006 (UTC)

My thanks to Kipmaster for implementing this change. Well done. SemperBlotto 16:28, 17 September 2006 (UTC)

Nouns used as adjectives: policy / format.

English is quite wanton about using nouns as adjectives, for example beer parlor. In the spirit of multilingualism, there should be a standard way to place the adjectival usage as a place to anchor translations of the adjectival usage. If there is no adjectival usage or it is otherwise non-standard to mark this (e.g. zero takes both plural and singular nouns). Rmo13 01:54, 31 August 2006 (UTC)

Not everything that modifies a noun is necessarily an adjective. --Ptcamn 02:26, 31 August 2006 (UTC)

That nouns may be used attributively is a morphological fact of English nouns. It does not make them adjectives. It makes them attributive nouns. Theoretically any noun may be used attributively, but in practice some are much more common as such. This may warrant a usage note on the noun. --sanna 07:26, 31 August 2006 (UTC)

This is a simplification. What part of speech a word belongs too is largely an artificial tradition, having been invented by grammarians analysing languages a long time ago. Morphology plays no part in the matter and deals with how individual words are built from parts and how they are inflected. If you take how a word is used to be primary and what category it belongs to as secondary, calling the use adjectival is accurate. If you take the category as primary only then does it seem to make sense to say some particular word is never an adjective. Using this kind of thinking just doesn't work at all for many languages including Chinese and Polynesian languages where individual words just don't have any meaningful part of speech category until they are within the context of a sentence. For English the case also is fuzzier than you expect. — Hippietrail 07:44, 31 August 2006 (UTC)

Actually, many linguists believe that parts of speech are in fact a very accurate representation of the way the human brain processes language (Steven Pinker for one). But that is beside the point. I don't think we need to list every noun's attributive use as ‘place to anchor translations’, because this is a function not of vocabulary but of grammar, and as such does not belong here but on Wikibooks or something. E.g. if you have a compound noun in English XY, a French-speaker knows that it will generally be rendered as Y de X. These are grammatical details and not to do with translations of words. Widsith 08:18, 31 August 2006 (UTC)

Except that it is very usual that an English attributive noun translated to an adjective in other languages. Dealing with clear indications of how to translate terms in all contexts is a major function of Wiktionary and it does need to be addressed. — Hippietrail 08:33, 31 August 2006 (UTC)

See also the (inconclusive) "unnecessary adjective senses" thread above.

But in general I think we have this situation pretty well in hand. Since the adjectival uses of nouns are usually pretty specific and idiosyncratic (is this true?), and since we're pretty liberal in having whole separate entries for compound words and set phrases such as beer parlour and sister city, we can always list the appropriate translations and other information there. And we're also pretty good at linking from base words such as beer and sister to derived terms involving them such as beer parlour and sister city (although as usual we can never quite decide what to call the section containing those links). —scs 13:36, 31 August 2006 (UTC)

All of this tends to support my view that the Category:English nouns should not be a part of the inflection template. Maybe that category shouldn't even exist at all. The part of speech is often determined by the usage; it is not inherent in its inflections. In theory any root form can be any of the major parts of speech (noun, verb, adjective, and possibly adverb) by simply varying the inflections. If a word is ordinarily viewed as being a noun, and there is no record in the corpus of its being used as a verb or adjective, I can still use it as an adjective or verb and a reader can understand it. Grammatical purists have objected to using impact as a verb, but it works; the hearer understands perfectly well when that happens. He also understands what it means when I use it adjectivally in "impact statement". As languages mature they simplify and become more syntactic. This is most evident in Chinese, and only somewhat less so in English. Chinese has been stable for a very long time.

Someone added the inflections "more Parisian" and "most Parisian" to the adjective "Parisian". This is structurally correct, but semantically questionable. I don't think I have ever seen "Parisian" used as a verb, but if I said, "The interior designer Parisianed the apartment," a certain impression would be conveyed. I can take this further and devise a sentence that gives meaning to "more Parisianed". Maybe we should look into the possibility of not using parts of speech at all! Eclecticology 07:20, 1 September 2006 (UTC)

You're joking about that last bit, right? I'm one who loves all manner of verbing nouns and nounificating verbs and engaging in nounificatory behavior with adjectives, but I think those hoary old part-of-speech tags still have some benefit (if only so that we can talk about the liberties we're taking with them...) —scs

Why should I be joking about it? Hippietrail hit upon a very interesting point about linguistics, that goes beyond what is merely convenient or conventional. Our concepts of parts of speech is rooted in the thinking of renaissance grammarians who saw Greco-Latin structures as the ideal basis for all language structures, and so proceeded to impose those structures on other languages. It would be impractical to suddenly say we're going to stop using parts of speech, but we should at least be open to that eventual possibility. Eclecticology 16:53, 1 September 2006 (UTC)

On this specific question, I haven't heard it in English, but trop Parisien became common in the 1990s as a description of surly, snooty service, which the French (and many others of us) associate with Paris, rather than France generally. --Enginear 10:05, 2 September 2006 (UTC)

This is a very interesting question. Would it be more appropriate to use language headers in the linguistic senses rather than the ones we know in English or developed in the history of another language? While probably superior in a theoretical framework, unfortunately I don't think this approach is practical for either contributors or dictionary users as being needlessly duplicative and simply overwhelming, aside from parting traditional dictionary style which is only possible with massive momentum. However, we shouldn't be dismissive of some day incorporating this sort of classification.

Another question is how low we've set the bar. Right now there's a tremendous concentration on drawing the boundary between words that are accepted or not, rather than on senses like the use of amazing as a noun or some of the examples I've recently (re-)raised in the Tea room. Eventually it may be questions along the lines of EC's examples that are more commonly put to RfV, and I have a suspicion that many of them will pass quite easily. DAVilla 18:07, 1 September 2006 (UTC)

The question is also relevant to Latin (which is suposed to be the root of all this wrangling). In theory, any Latin adjective may be used as a substantive (as a noun), though the word retains at least some of the inflection of an adjective. Likewise, Latin verbs have participle forms that are used as adjectives and gerundive forms that are used as nouns. For English, I would tend to prefer the headers of Adjective and Noun for these participle and gerund cases (with an additional Verb/participle form header), but in Latin this becomes more of a headache. In Latin, I think I would prefer to list adjectival verb forms with Participle as the part of speech, since the inflective grammar and meaning lie in the verb. I haven't decided yet how substantive senses of Adjectives in Latin ought to be listed, since calling them Nouns makes misleading assertions about the inflection, but the substantive sense may have subtle distinctions from the adjective sense that need to be explained. Arrgh! --EncycloPetey 20:22, 4 September 2006 (UTC)