Module:he-translit/old

The following documentation is located at Module:he-translit/old/documentation. Categories were auto-generated by Module:module categorization.

Useful links: root page • root page’s subpages • links • transclusions • testcases • sandbox

	This module is still being disputed.
	Do not use this module. This module was designed to follow WT:HE TR, which does not have consensus among the Hebrew editors; moreover, it does not follow WT:HE TR strictly.

This module will transliterate Hebrew language text per WT:HE TR.

The module should preferably not be called directly from templates or other modules. To use it from a template, use {{xlit}}. Within a module, use Module:languages#Language:transliterate.

For testcases, see Module:he-translit/old/testcases.

Functions

tr(text, lang, sc): Transliterates a given piece of text written in the script specified by the code sc, and language specified by the code lang.; When the transliteration fails, returns nil.

Test cases

11 of 150 tests failed. (refresh)

test_biblical:
	Text	Expected	Actual	Differs at	Comments
	בַּיִת	bayiṯ	bayiṯ
	בֵּית	bēṯ	bēṯ
	עַכּוֹ‎		ʿakkō‎	1
	בָּתִּים	bāttīm	bāttīm
	מַחֲנֶה	maḥăne	maḥăne
	בָּרָא	bārā	bārā
	רֶגֶל	reḡel	reḡel
	כֹּהֵן	kōhēn	kōhēn
	מֶלֶךְ	meleḵ	meleḵ
	מַמְלָכָה	mamlāḵā	mamlāḵā
	הַמַּמְלָכָה	hammamlāḵā	hammamlāḵā
	הַלְּלוּיָהּ	halləlūyāh	halləlūyāh
	הַלְלוּיָהּ	haləlūyāh	haləlūyāh
	יָדַע	yāḏaʿ	yāḏaʿ
	שָׁבוּעַ	šāḇūaʿ	šāḇūaʿ
	רוּחַ	rūaḥ	rūaḥ
	גָּבֹהַּ	gāḇōah	gāḇōah
	מָשִׁיחַ	māšīaḥ	māšīaḥ
	רֵיחַ	rēaḥ	rēaḥ
	שָׂדֶה	śāḏe	śāḏe
	שְׂדֵה	śəḏē	śəḏē
	בָּנַי	bānay	bānay
	בְּנֵי	bənē	bənē
	צָרְכִּי	ṣorkī	ṣorkī
	חָכְמָה	ḥāḵəmā	ḥāḵəmā		ambiguous case: could be ḥāḵəmā or ḥoḵmā, but I think ḥāḵəmā is the preferred default
	שִׁפְרָה	šip̄rā	šip̄rā
	שָׁכְבְּךָ	šoḵbəḵā	šoḵbəḵā
	הָפְכָּה	hop̄kā	hop̄kā		made-up word, but a particular potentially problematic Unicode situation
	קָטְבּוֹ	qoṭbō	qoṭbō		another particular potentially problematic Unicode situation
	נִשְׂרְפָה	niśrəp̄ā	niśrəp̄ā
	בָּנָיו	bānāw	bānāw
	בָּנֶיהָ	bānehā	bānehā
	מִצְוֹת	miṣwōṯ	miṣwōṯ
	זִוּוּג	ziwwūḡ	ziwwūḡ
	רֹאשׁ	rōš	rōš
	רֵאשִׁית	rēšīṯ	rēšīṯ
	רִאשׁוֹן	rīšōn	rīšōn
	מְלָאכָה	məlāḵā	məlāḵā
	מְלֶאכֶת	məleḵeṯ	məleḵeṯ
	חֵטְא	ḥēṭ	ḥēṭ
	בָּרָאתָ	bārāṯā	bārāṯā
	חַטֹּאות	ḥaṭṭōṯ	ḥaṭṭōṯ
	יְראוּ	yərū	yərū
	וַיֶּאְסֹר	wayyeʾsōr	wayyeʾsōr
	הָחְלַט	hoḥlaṭ	hoḥlaṭ
	וַיֵּבְךְּ	wayyēḇk	wayyēḇk
	אַרְאֶךָּ	ʾarʾekkā	ʾarʾekkā
	וַיַּשְׁקְ	wayyašq	wayyašq
	אַתְּ	ʾatt	ʾatt
	וּוָווֹ	ūwāwō	ūwāwō
	וָו	wāw	wāw
	תָּו	tāw	tāw
	קַו	qaw	qaw
	לָאו	lāw	lāw
	חַי	ḥay	ḥay
	חָי	ḥāy	ḥāy		pausal
	פִּיו	pīw	pīw
	כִּסְלֵו	kislēw	kislēw
	גּוֹי	gōy	gōy
	גֹּי	gōy	gōy
	גֹּיִים	gōyīm	gōyīm
	רָאוּי	rāʾūy	rāʾūy
	קִיא	qī	qī
	יָבִיאוּ	yāḇīʾū	yāḇīū	5
	יְבִיאוּן	yəḇīʾūn	yəḇīūn	5
	מֵאוּן	mēʾūn	mēʾūn
	מֵיאוּן	mēʾūn	mēyūn	3
	בּוֹאוּ	bōʾū	bōʾū
	בֹּאוּ	bōʾū	bōʾū
	בּוּאוּ	būʾū	būʾū		made-up word, but may help identify the issue
	אָבִיאָה	ʾāḇīʾā	ʾāḇīʾā
	מֵאָה	mēʾā	mēʾā
	גֵּיאָהּ	gēʾāh	gēʾāh
	אָבוֹאָה	ʾāḇōʾā	ʾāḇōʾā
	אָבֹאָה	ʾāḇōʾā	ʾāḇōʾā
	נְשׂוּאָה	nəśūʾā	nəśūʾā
	קִיאוֹ	qīʾō	qīō	3
	גֵּאוֹ	gēʾō	gēʾō
	גֵּיאוֹ	gēʾō	gēʾō
	בּוֹאוֹ	bōʾō	bōʾō
	בֹּאוֹ	bōʾō	bōʾō
	מִלּוּאוֹ	millūʾō	millūʾō
	מִי	mī	mī
	אִיִּים	ʾiyyīm	ʾiyyīm
	אִיּוֹב	ʾiyyōḇ	ʾiyyōḇ
	אִיּוּן	ʾiyyūn	ʾiyyūn
	אַיִן	ʾayin	ʾayin
	בּוֹא	bō	bō
	יְפֵהפֶה	yəp̄ēp̄e	yəp̄ēp̄e
	אֹהֶל	ʾōhel	ʾōhel
	הָאֹהֱלָה	hāʾōhĕlā	hāʾōhĕlā
	אָהֳלוֹ	ʾohŏlō	ʾāhŏlō	2
	אָהָלְךָ	ʾoholəḵā	ʾāhāləḵā	2
	יִשָּׂשכָר	yiśśāḵār	yiśśāḵār		Still undecided if this actually needs to be handled
	הוֹשִׁיעָה נָּא	hōšīʿā nnā	hōšīʿā nnā
	עַד בֹּאֲךָ	ʿaḏ bōʾăḵā	ʿaḏ bōʾăḵā
	וַיַּשְׁקְ אֶת הַצֹּאן	wayyašq ʾeṯ haṣṣōn	wayyašq ʾeṯ haṣṣōn
	בְּנֵי בְרָק	bənē ḇərāq	bənē ḇərāq
	בְרָק	ḇərāq	ḇərāq
	אִישׁ יְהוּדִי הָיָה בְּשׁוּשַׁן הַבִּירָה וּשְׁמוֹ מָרְדֳּכַי בֶּן יָאִיר בֶּן־שִׁמְעִי בֶּן־קִישׁ אִישׁ יְמִינִי׃	ʾīš yəhūḏī hāyā bəšūšan habbīrā ūšəmō mordŏḵay ben yāʾīr ben-šimʿī ben-qīš ʾīš yəmīnī.	ʾīš yəhūḏī hāyā bəšūšan habbīrā ūšəmō mordŏḵay ben yāʾīr ben-šimʿī ben-qīš ʾīš yəmīnī.
	אִ֣ישׁ יְהוּדִ֔י הָיָ֖ה בְּשׁוּשַׁ֣ן הַבִּירָ֑ה וּשְׁמ֣וֹ מָרְדֳּכַ֗י בֶּ֣ן יָאִ֧יר בֶּן־שִׁמְעִ֛י בֶּן־קִ֖ישׁ אִ֥ישׁ יְמִינִֽי׃	ʾīš yəhūḏī hāyā bəšūšan habbīrā ūšəmō mordŏḵay ben yāʾīr ben-šimʿī ben-qīš ʾīš yəmīnī.	ʾi֣yš yəhūḏi֔y hāyā֖h bəšūša֣n habbīrā֑h ūšəm֣ō mordŏḵa֗y be֣n yāʾi֧yr ben-šimʿi֛y ben-qi֖yš ʾi֥yš yəmīniֽy.	2	fully accented verse; stress should not be indicated in the final syllable
	וַיְהִי הַמַּבּוּל אַרְבָּעִים יוֹם עַל־הָאָרֶץ וַיִּרְבּוּ הַמַּיִם וַיִּשְׂאוּ אֶת־הַתֵּבָה וַתָּרָם מֵעַל הָאָרֶץ׃	wayəhī hammabbūl ʾarbāʿīm yōm ʿal-hāʾā́reṣ wayyirbū hammáyim wayyiśəʾū ʾeṯ-hattēḇā wattā́rom mēʿal hāʾāreṣ.	wayhī hammabbūl ʾarbāʿīm yōm ʿal-hāʾāreṣ wayyirbū hammayim wayyiśʾū ʾeṯ-hattēḇā wattārām mēʿal hāʾāreṣ.	4	a reminder of why this is hard
	וַיְהִ֧י הַמַּבּ֛וּל אַרְבָּעִ֥ים י֖וֹם עַל־הָאָ֑רֶץ וַיִּרְבּ֣וּ הַמַּ֗יִם וַיִּשְׂאוּ֙ אֶת־הַתֵּבָ֔ה וַתָּ֖רָם מֵעַ֥ל הָאָֽרֶץ׃	wayəhī hammabbūl ʾarbāʿīm yōm ʿal-hāʾā́reṣ wayyirbū hammáyim wayyiśəʾū ʾeṯ-hattēḇā wattā́rom mēʿal hāʾāreṣ.	wayhi֧y hammabb֛ūl ʾarbāʿi֥ym y֖ōm ʿal-hāʾā֑reṣ wayyirb֣ū hamma֗yim wayyiśʾū֙ ʾeṯ-hattēḇā֔h wattā֖rām mēʿa֥l hāʾāֽreṣ.	4	fully accented verse version of the above
implicit ktiv/qre that would be nice to have
	הִוא	hī	hī
	יְרוּשָׁלִַם	yərūšālayim	yərūšālayim
	יְרוּשָׁלִָם	yərūšālāyim	yərūšālāyim		pausal form
	יְרוּשָׁלֲמָה	yərūšālaymā	yərūšālaymā
	יְרוּשָׁלֳמָה	yərūšālāymā	yərūšālāymā
ktiv male tests
	חַיָּיב	ḥayyāḇ	ḥayyāḇ
	חַוָּוה	ḥawwā	ḥawwā
	הֱוֵוה	hĕwē	hĕwē
	הַיְינוּ	haynū	haynū
	הִתְכַּוְּונוּ	hiṯkawwənū	hiṯkawwənū
	גַּוְונָא	gawnā	gawnā
	מְייוּחָד	məyūḥāḏ	məyūḥāḏ		there is no way to tell that it really should be məyuḥāḏ, but anyway this test is for the double yod
	כְּדַאי	kəḏay	kəḏay
	כּוּלָּם	kullām	kullām		shuruk does not necessarily imply a long vowel
	קִידּוּשׁ	qiddūš	qiddūš		chiriq male does not necessarily imply a long vowel

test_translit_hebrew:
Text	Expected	Actual	Differs at	Comments
מַקְלֵעַ	maklea'	maklea'
אַבְּסוּרְד	'ab'sur'd	'ab'sur'd		not sure about what should be expected here
בִּיּוֹמֶטְרִיָּה	biyometriya	biyometriya
קַפְרִיסִין	kafrisin	kafrisin
חֹרֶף	khoref	khoref
טוּרְקִיז	turkiz	tur'kiz	4
טַחַב	takhav	takhav
יִוָּלֵד	yivaled	yivaled
יָקִינְתּוֹן	yakinton	yakinton
כֻּתְנָה	kutna	kutna
נַגָּרִיָּה	nagariya	nagariya
נַעֲלֶה	na'ale	na'ale
מִצְווֹת	mitsvot	mitsvot
מָקוֹם	makom	makom
פֶּרוּאָנִי	peru'ani	peru'ani
צִדְפָּה	tsidpa	tsidpa
תׇּכְנָה	tokhna	tokhna
רְאוּ	r'u	r'u
גּ׳וּק	juk	juk
ג׳וּק	juk	juk
גִּ׳ירָאפָה	jirafa	jirafa
גִ׳ירָאפָה	jirafa	jirafa
זַ׳רְגוֹן	zhargon	zhargon
קַפּוּצִ׳ינוֹ	kapuchino	kapuchino
סְקוֹץ׳	s'koch	s'koch
סְתוֹם תַּ׳פֶּה	s'tom ta′pe	s'tom ta′pe
אִמָּא׳לֶה	'ima′le	'ima′le
חָזָ״ל	khaza″l	khaza″l
נַחַ״ל	nakha″l	nakha″l
רה״מ	rh″m	rh″m
ב״ה	b″h	b″h
ת״א	t″'	t″'

local export = {}
local U = require("Module:string/char")
local gsub = mw.ustring.gsub

--[[
-- Uncomment this to redefine gsub so that it prints to the Lua log
-- the names of the code points in the replacements it's making.
local function print_code_point_names(text)
	if not text then return "" end
	local names = require "Module:array"()
	for cp in mw.ustring.gcodepoint(text) do
		names:insert(require "Module:Unicode data".lookup_name(cp))
	end
	return names:concat ", "
end

local actual_gsub = mw.ustring.gsub
local gsub = function(...)
	local old, pattern, repl = ...
	local new, count = actual_gsub(...)
	if old ~= new then
		mw.log(table.concat({
			print_code_point_names(old),
			print_code_point_names(new),
			pattern,
			tostring(repl)
		}, "\n") .. "\n")
	end
	return new, count
end
--]]

local sheva = U(0x05B0)
local hataf_segol = U(0x05B1)
local hataf_patah = U(0x05B2)
local hataf_qamats = U(0x05B3)
local hiriq = U(0x05B4)
local tsere = U(0x05B5)
local segol = U(0x05B6)
local patah = U(0x05B7)
local qamats = U(0x05B8)
local qamats_qatan = U(0x05C7)
local holam = U(0x05B9)
local holam_haser_for_waw = U(0x05BA)
local qubuts = U(0x05BB)
local dagesh_mappiq = U(0x05BC)
local shin_dot = U(0x05C1)
local sin_dot = U(0x05C2)

local macron_above = U(0x0304)
local macron_below = U(0x0331)
local macron = ""

local alef = "א"
local he = "ה"
local waw = "ו"
local yod = "י"
local vowel_letters = alef .. he .. waw .. yod
local vowel_letter = ""

-- '0' represents silent sheva
local vowel_points = (
	sheva .. hataf_segol .. hataf_patah .. hataf_qamats .. hiriq .. tsere ..
	segol .. patah .. qamats .. qamats_qatan .. holam .. qubuts .. '0' ..
	holam_haser_for_waw
)
local vowel_point = ""
local short_vowels = segol .. patah .. hiriq .. qubuts .. qamats_qatan
local short_vowel = ""

local shuruq = waw .. dagesh_mappiq
local holam_male = waw .. holam

-- use dummies characters that do not match as punctuation
-- the dummy letter stands in for final silent alef or he, or for the hiatus before a furtive patah,
-- or comes before a pre-transliterated waw to aid in matching
local dummy_letter = U(0x0627) -- ARABIC LETTER ALEF
local dummy_geresh = U(0x064E) -- ARABIC FATHA
local dummy_gershayim = U(0x064B) -- ARABIC FATHATAN
local real_geresh = '׳'
local real_gershayim = '״'
local letter_modifier = "??"
local letters = "אבגדהוזחטיכךלמםנןסעפףצץקרשת"
local letter = "" .. letter_modifier
local letter_not_waw = "" .. letter_modifier
local gutturals = "אהחע"
local guttural = ""

local vowel_letter_or_geresh = ""

-- note, the geresh and gershayim are included in this, which is why dummies are used in their place
local word_break_chars = "%s%p"
local word_break = ""
local word_start = "%f" -- matches the boundary but not the actual word break characters
local word_end = "%f" -- matches the boundary but not the actual word break characters

local tr_vowels = "aeiouāēīōūəăĕŏ0"

local biblical_to_modern = {
	 = '\'',
	 = 'v',
	 = 'g',
	 = 'd',
	 = 'v',
	 = 'zh',
	 = 'kh',
	 = 't',
	 = 'kh',
	 = '\'',
	 = 'f',
	 = 'ts',
	 = 'ch',
	 = 'k',
	 = 'sh',
	 = 's',
	 = 't',

	 = '\'',
	 = 'e',
	 = 'a',
	 = 'o',
	 = 'i',
	 = 'e',
	 = 'a',
	 = 'o',
	 = 'u',
}

-- helper function to remove vowel letters but keep gereshes
local function gereshes(str)
	return gsub(str, vowel_letter, '')
end

local biblical = {
	{
		-- replace geresh and gershayim with their dummy equivalents so that they won't match as word boundaries
		 = dummy_geresh,
		 = dummy_gershayim,
	},

	{
		-- The default order is: consonant, vowel point, dagesh or mappiq, shin or sin dot.
		-- The desired order is: consonant, shin or sin dot, dagesh or mappiq, vowel point.
		-- Also, move geresh and gershayim closer to the letter for easier handling (will be moved back later if not actually a modifier)
		)(" .. vowel_point .. "*)(" .. dagesh_mappiq .. "*)(*)(*)"] = "%1%4%5%3%2",
	},

	{
		-- special case: change qamats in כל to qamats qatan
		-- the problem is that כל might be preceded by prefixed clitics, which maybe be chained indefinitely,
		-- while other unrelated words might happen to end in כל with a qamats gadol; therefore, match either
		-- the entire word or only when preceded by a precisely recognized prefix
		 = "%1" .. qamats_qatan .. "%2",
		" .. dagesh_mappiq .. "?" .. patah .. "כ" .. dagesh_mappiq .. ")" .. qamats .. "(ל)" .. word_end] = "%1" .. qamats_qatan .. "%2",
		 = "%1" .. qamats_qatan .. "%2",
		כ" .. dagesh_mappiq .. ")" .. qamats .. "(ל)" .. word_end] = "%1" .. qamats_qatan .. "%2", -- patah is very archaic
		" .. dagesh_mappiq .. "?" .. sheva .. "כ)" .. qamats .. "(ל)" .. word_end] = "%1" .. qamats_qatan .. "%2",
	},

	{
		-- remove final alef and he, but only when preceded by a vowel
		" .. word_end] = "%1" .. dummy_letter,
		" .. word_end] = "%1" .. dummy_letter,
	},

	{
		-- these are the cases, other than the above, where a final letter should be ignored
		" .. word_end] = "ī",
		)" .. vowel_letter_or_geresh .. "-" .. word_end] = "%1",
		)" .. vowel_letter_or_geresh .. "-" .. word_end] = "%1",
	},

	{
		 = "0%1" .. sheva, -- two shevas in a row
		 = "%10", -- after a short vowel, assume(!) a silent sheva
		 = "%10", -- gutturals cannot have a vocal sheva

		 = "%1" .. dummy_letter .. "ww", -- when waw + dagesh is not a shuruq
		 = "%1" .. dummy_letter .. "ww%2", -- when waw + dagesh is not a shuruq
		 = "%1" .. dummy_letter .. "w" .. holam, -- when waw + holam is not a holam male

		)" .. dagesh_mappiq] = "%1", -- handle mappiq (very rarely occurs on an alef)
	},

	{
		 = shuruq .. "ww", -- another potential case when waw + dagesh is not a shuruq
		 = shuruq .. "w" .. holam, -- another potential case when waw + holam is not a holam male

		-- tentatively lengthen hiriqs with vowel letters
		 = function(vlg, l) return "ī" .. gereshes(vlg) .. l end,

		-- rearrange furtive patach (mappiq should already have been removed, but handle it just in case)
		 = dummy_letter .. "a%1",
	},

	{
		-- remove vowel letters
		 = function(l, vlg) return l .. gereshes(vlg) .. shuruq end,
		 = function(vlg, l) return shuruq .. gereshes(vlg) .. l end,
		)"] = function(vlg, l) return shuruq .. gereshes(vlg) .. l end,
		 = function(vp, vlg, l) return vp .. gereshes(vlg) .. l end,
		)"] = function(vp, vlg, l) return vp .. gereshes(vlg) .. l end,
	},

	{
		-- handle two-character combinations first
		 = 'j',
		 = 'ž',
		' .. dummy_geresh] = 'č',
		 = 'š',
		 = 'ś',
	},

	{
		 = 'ʾ',
		 = 'b' .. macron_below,
		 = 'g' .. macron_above,
		 = 'd' .. macron_below,
		 = 'h',
		 = 'z',
		 = 'ḥ',
		 = 'ṭ',
		 = 'y',
		'] = 'k' .. macron_below,
		 = 'l',
		'] = 'm',
		'] = 'n',
		 = 's',
		 = 'ʿ',
		'] = 'p' .. macron_above,
		'] = 'ṣ',
		 = 'q',
		 = 'r',
		 = 't' .. macron_below,
	},

	{
		)' .. macron .. '?' .. dagesh_mappiq] = '%1', -- assume(!) dagesh qal at the beginning of a word
		()' .. macron .. '?' .. dagesh_mappiq] = '0%1', -- dagesh qal after sheva, and assume(!) silent sheva
		 = '%1' .. sheva .. '%1', -- vocal sheva between identical consonants
		 = 'ū',
	},

	{
		-- restore geresh and gershayim order
		)(" .. dagesh_mappiq .. "*)(" .. vowel_point .. "*)"] = "%2%3%1",
	},

	{
		-- handle ירושלם
		 = "ayi", -- in this case, the vowels are reversed by Unicode normalization rules
		 = "ayi", -- just in case they're in the correct order
		 = "āyi", -- pausal form of above
		 = "āyi", -- as above
		-- handle ירושלמה
		" .. patah] = "ay", -- in this case, the vowels are reversed by Unicode normalization rules
		"] = "ay", -- just in case they're in the correct order
		" .. qamats] = "āy", -- pausal form of above
		"] = "āy", -- as above
	},

	{
		 = 'ə',
		 = 'ĕ',
		 = 'ă',
		 = 'ŏ',
		 = 'i',
		 = 'ē',
		 = 'e',
		 = 'a',
		 = 'ā',
		 = 'o',
		 = 'u',
		 = '',
		 = '',
		 = 'ō',
		 = 'wō',
	},

	{
		 = '%1%1', -- gemination
	},

	{
		(k' .. macron_below .. ')'] = '%1%2', -- special case for יששכר
	},

	{
		 = 'o%1', -- assume(!) qamats qatan before silent sheva

		 = 'ō',
		 = 'w',
		 = 'š', -- assume(!) shin if no shin or sin dot
	},

	{
		-- handle bgdkpt letters in unvocalized words (such as acronyms)
		-" .. macron .. "-)" .. word_end] = function(w) return gsub(w, "()" .. macron, "%1") end
	},

	{
		"] = "",

		-- short vowels in non-final closed syllables (this rule should be expanded)
		 = "u%1%1",
		 = "i%1%1",
	},

	{
		 = "", -- final sheva is always silent

		 = '′',
		 = '″',
		 = '.', -- sof pasuq
		 = '-', -- maqaf
	},
}

function export.tr(text, lang, sc)
	-- default to modern for Hebrew, but not for other languages, such as Aramaic
	local modern = lang == "he"
	return export.biblical(text, modern)
end

function export.biblical(text, modern)
	-- decompose
	text = mw.ustring.toNFD(text)

	-- wrap with spaces to make initial and final replacements easier
	text = ' ' .. text .. ' '

	for _, replacements in ipairs(biblical) do
		for regex, replacement in pairs(replacements) do
			text = gsub(text, regex, replacement)
		end
	end

	-- unwrap spaces
	text = mw.ustring.match(text, "^ (.*) $")
	if text == nil then error("Something went wrong, wrapped spaces were deleted.") end

	-- must happen before recomposition
	if modern then
		text = gsub(text, "()%1", "%1")
		text = gsub(text, "" .. macron .. "?", function(x) return biblical_to_modern or x end)
		text = gsub(text, "''", "'")
	end

	-- recompose
	text = mw.ustring.toNFC(text)

	return text
end

return export

Module:he-translit/old

Functions

Test cases

Wikious

Boobota

Sagapedia