@Wyang do you think you can put Module:km-translit on your to-do list? :) User:Stephen G. Brown might also help. --Anatoli (обсудить/вклад) 14:45, 17 June 2014 (UTC)
@Stephen G. Brown, Atitarev, Octahedron80, Hippietrail, Nisetpdajsankha, វ័ណថារិទ្ធ, Judexvivorum
Hi all. As some of you may have noticed, I am currently working on the infrastructure for the automatic romanisation and pronunciation of Khmer on Wiktionary, as well as a revamp of the page Wiktionary:Khmer romanization. The aim of this is to achieve automatic Khmer romanisation on Wiktionary pages with no or very minimal need for manual input, similar to what we currently have for Thai. The module used as the backend for this will be Module:km-pron, to be developed similarly to the now-mature Module:th-pron.
The Khmer script is not very phonetic, so purely predicting the pronunciation from orthography is not feasible. The renowned Chuon Nath Dictionary and other dictionaries make use of phonetic respellings ― though seemingly inconsistently ― to indicate the pronunciation of irregular words, and this principle will also be used as the basis for this project. I think it is possible to accurately derive IPA pronunciations from respellings (provided the respellings are defined clearly), again in a fashion akin to the existing Thai infrastructure, although there are dictionaries in Thai that systematically annotate any word whose orthography is not in a phonetically respelt form, with the phonetically respelt form of the word, which appear to be lacking for Khmer. (or maybe there are also Khmer dictionaries as such?) Some examples of phonetically respelt forms of Khmer words, and an attempt to syllabify based on respellings, can be found on Module:km-pron/testcases (in test_syllabify
).
We also need to decide on our approach to Khmer romanisation, if we can achieve automatic pronunciation on Khmer entries. At present the Khmer romanisations are quite heterogeneous: this romanisation guide appears to recommend the UN system, Stephen has been using an IPA-derived romanisation, and some entries use other systems or have no romanisation. Technically, it seems none of the commonly schemes (UN, Geographical, BGN/PCGN, ALA) can be fully automated, as they all rely on both orthographical and pronunciation information of the word simultaneously, to varying extents. Several possibilities exist for romanisation, IMO:
Transcription
, open to feedback of course. The template {{km-IPA}}
will be added to all Khmer entries, and the romanisation will be extracted from the template call on the entry, exactly like Thai {{th-pron}}
. Such a transcription system could again be applied in various manners, for example a Thai-like structure, which is attempting to romanise any Khmer word by extracting its phonetic respelling and transcription from its entry, and if successful, use that transcription; if unsuccessful:
nil
, and display no romanisation; orIf we think this approach for pronunciation and romanisation is worth adopting, the pertinent tasks for the time being would include arranging and codifying the respelling–pronunciation correspondences in the form of tables on Wiktionary:Khmer romanization, and after that is complete, implementing those rules in Module:km-pron to achieve the automatic conversion.
I apologise for this relatively long post. Any comments, suggestions, technical help, criticisms, moral support, etc. would be welcome. Thank you!
Wyang (talk) 10:33, 10 February 2018 (UTC)
{{km-IPA}}
and module Module:km-pron, which were written based on the IPA and transcriptions on that page, have passed preliminary testing (Module:km-pron/testcases) and should be mostly ready for use. Examples of how the pronunciation template would look on Khmer entries can be found on the template page.{{l}}
, {{m}}
, {{cog}}
, etc.) and have headword-line templates display romanisations by extracting the transcription from the Khmer entries, like Thai. This will remove most, if not all, need for manual transcription/transliteration and greatly simply our work. The terms that fail automatic romanisation can still use Module:km-translit as a fallback.I just wanted to put in my two cents. I think this is great. It seems to work very well for Thai. (It would be great to see more work done for Lao by the way, which in theory has a nice logical orthography, but in practice turns out to have a ton of very difficult edge cases!)
It's over two years since I was in Cambodia learning Khmer and I've never found anyone to practice with since. I do have a self-study textbook, a small dictionary with some grammar, and a large pocket dictionary. I also have a Lonely Planet Southeast Asia Phrasebook with a Khmer section.
I have to say I vastly prefer the romanization system User:Stephen G. Brown has developed over the years to the ones in any of my books and to the one we've been auto-generating here for some time. My next preference is a simple phonemic IPA-based system. Not a too-narrow one full of diacritics.
I think we could aim at something between what we do for Chinese and Thai, and what we do for English pronunciation. We can support multiple "official" transcriptions that are automatically generated from the most detailed one, or from one we design ourselves if none of the others are sufficiently detailed, or straight from the IPA. In the case of English pronunciation transcription we devised our own system to compliment IPA since many people have an irrational dislike for IPA and since all American dictionaries use somewhat similar non-IPA systems. If we were to go down that path we could use Stephen's system, or develop something using that as a starting point.
In short I think Stephen's input here would be key. I've always been impressed by his knowledge and work here. Combining his skills with you guys' technical automation/scripting skills we could come up with something that is the best anywhere on the internet.
I'm really looking forward to seeing how this goes. (And please put Lao on the list for attention soon.) — hippietrail (talk) 00:11, 13 February 2018 (UTC)
{{km-IPA}}
if the word is spelt with an independent vowel letter. Good point regarding the subscript forms too; I removed the subscript forms of the composite consonant letters. I was thinking something like sf- in Western loanwords, where the composite consonant could be written as a subscript, but such words probably do not exist at all.test_transcript
). It is largely consistent with the system you have been using for Khmer entries, with some small differences. Transcriptions of the vowels are mainly aligned with their IPA pronunciation, and I tried to minimise the use of vowels with diacritics in the system, with the exception of the three short diphthongs ĕə, ŭə and ŏə, which can be contrastive with the long diphthongs, e.g. uə vs. ŭə. This is not an urgent issue, as the module backend Module:km-pron can be easily modified if we would like to alter the transcriptions of certain consonants or vowels.{{m|km|បរិយោសាន}}
(current output: បរិយោសាន (paʼreyaosaan)) will make the template extract the romanisation paʾreʾyaosaan from the បរិយោសាន entry directly, rather than sending it to Module:km-translit for the auto-transliteration (to produce bârĭyoŭsan). This way we can ensure that the romanisation produced by the template is always correct. This is similar to how links to Thai terms, such as {{m|th|ประธานาธิบดี}}
(output: ประธานาธิบดี (bprà-taa-naa-típ-bɔɔ-dii)), get their correct romanisations from the target entry (bprà-taa-naa-tí-bɔɔ-dii), rather than an auto-transliteration module.{{km-IPA}}
is desirable, I will start to apply the template on Khmer entries and gradually remove the existing manual transcriptions to make it entirely automatic like Thai. Wyang (talk) 09:40, 13 February 2018 (UTC)I was going to add testcases for two Khmer words that come up early when you start learning the language that are very difficult for English speakers to pronounce: the words for I and delicious.
The former, ខ្ញុំ, is already included, but the later, ឆ្ងាញ់ is not yet.
I see the format is different from the equivalent page for Lao. The format isn't difficult but as I'm very rusty with Khmer and the details of how we transliterate it here, I'm not sure what to put in all of the fields. Would it be of benefit to use ឆ្ងាញ់ as a short tutorial example on how to add a testcase either right here, or on the testcase page? By the way it would be beneficial if the testcase page had a direct link to the testcase data page so people can add testcases if they have language knowledge but not so much module/Lua/scripting/hacking knowledge. — hippietrail (talk) 00:27, 13 February 2018 (UTC)
(Notifying Stephen G. Brown, Wyang, Octahedron80): : Hi. I think of replacing instances of ំ with ំ (ំ) in Wiktionary:Khmer_romanization#Diacritics - just the first column. Any objections? They get left-aligned in the table, though.--Anatoli T. (обсудить/вклад) 11:47, 17 March 2018 (UTC)