Appendix:Unicode normalization

Hello, you have come here looking for the meaning of the word Appendix:Unicode normalization. In DICTIOUS you will not only get to know all the dictionary meanings for the word Appendix:Unicode normalization, but we will also tell you about its etymology, its characteristics and you will know how to say Appendix:Unicode normalization in singular and plural. Everything you need to know about the word Appendix:Unicode normalization you have here. The definition of the word Appendix:Unicode normalization will help you to be more precise and correct when speaking or writing your texts. Knowing the definition ofAppendix:Unicode normalization, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.
See also Unicode normalization considerations on the MediaWiki website.

Wikimedia, along with most servers on the internet, stores Unicode strings in the form called NFC or Normalization Form (Canonical) Composition. This means that often several different Unicode strings are mapped to the same canonical form. When you enter a Unicode string and save the page, it is automatically converted to the normalized form. Non-normalized strings cannot be saved in a Wiktionary page.

Equivalence

Type of Canonical
Equivalence
Alternate
representation
NFC
Combining sequence C ◌̧ Ç
Ordering of combining marks q + ̣+ ̇ q+ ̇+ ̣
Hangul ᄀ +ᅡ
Singleton Å
Hebrew ל ָ ֽ ִ ל ִ ָ ֽ

Issues

Most of the time NFC makes processing text easier, but there are some oddities, both semantic and non-semantic that do appear. There are four cases where single characters are not the NFC form.

  1. Sometimes an alternative single character is the canonical composed form.
    Example: U+212B ( Å - ANGSTROM SIGN) is converted to U+00C5 ( Å - LATIN CAPITAL LETTER A WITH RING ABOVE)
  2. For some scripts, precomposed characters are not preferred.
    Example: U+0958 ( क़ - DEVANAGARI LETTER QA) is converted to the decomposed क़ which is U+0915 ( - DEVANAGARI LETTER KA) + U+093C ( - DEVANAGARI SIGN NUKTA).
  3. Where a decomposition exists in pre-Unicode 3.0 for a precomposed character added afterwards, the decomposition is preferred.
    Example: U+2ADC ( ⫝̸ - FORKING) is converted to ⫝̸ which is U+2ADD ( - NONFORKING) + U+0338 ( ̸ - COMBINING LONG SOLIDUS OVERLAY).
  4. A decomposition is preferred to precomposed characters where the decomposition begins with a non-starter.
    Example: U+0344 ( ̈́ - COMBINING GREEK DIALYTIKA TONOS) is converted to U+0308 ( ̈ - COMBINING DIAERESIS (DIALYTIKA)) + U+0301 ( ́ - COMBINING ACUTE ACCENT (OXIA, TONOS)).

In a number of common cases, Unicode's canonical ordering of two diacritics is counterintuitive, and/or interoperates poorly with certain existing software. In other, less common cases, the problem is that the diacritics should not have a canonical ordering, because the two orderings are not actually equivalent (that is, the two diacritics should have the same value for the Canonical_Combining_Class (ccc) property, but instead they have different ones). For example, Hebrew לִַ ("lai") is mistakenly normalized to לִַ ("lia").

As the conversion is automatic, there cannot exist pages for the non-NFC form. Attempting to explicitly link to the non-NFC form, , will display the non-NFC form, but when clicked on will take the user to the NFC page Å.

Display

One can display the non-NFC characters on a page using {{HTML char}} ({{HTML char|212B}} will show Å). To note canonical equivalence between two single characters, use {{normalization}} in the caption field of the appropriate {{character info}} template on the NFC character (see Å for an example). To note that the NCF of a precomposed character is a decomposition, use {{decomposed}} in the caption field of the appropriate {{character info}} template on the NFC decomposition (see क़ for an example).

Notes

Wikimedia does not enforce Compatibility Equivalence which combines even more forms together (such as N and ).

See also