Thura's Journal: Unicode Character Boundary Segmentation for Burmese

Character boundary segmentation for Burmese is still not working properly in all platforms supporting Myanmar unicode (Windows , OS X, ICU, Pango). Although the default character segmentation specification, described in Unicode Standard Annex #29, works well for most Burmese text, there are, however, still some cases, where it needs to be tailored be work properly for Burmese.

Normally, Burmese (user-perceived) characters are a combination for consonants + marks (medials, vowel, tones). However, this simple rule doesn't work when final (devowelized) consonants are involved. Final consonants are consonants with their inherent vowel killed. There are three ways final consonants appear in Burmese –

Followed by U103A (ASAT SIGN) (က်)
Stacked together with succeeding consonants (က္က)
As Kinzi (င်္)

Two separate rules for segmentation of those final consonants need to be considered.

1. When followed by ASAT — (က်, မ်, င်, ည် …)

They are segmented as separate characters in all platforms currently supporting Myanmar unicode (ICU, Windows8, pango). For example, မောင် is segmented into two separate characters – မော and င်. Although they seem visually separate, မောင် is considered as one character in Burmese. Let's say — we want to display “ဒိန်ချဉ်ဆိုင်” vertically, as in a signboard. This will be displayed as:

ဒိ

န်

ချ

ဉ်

ဆို

င်,

where it should be displayed as:

ဒိန်

ချဉ်

ဆိုင်

2. Kinzi and Stacked Consonants — (အင်္ဂါ, အက္ခရာ…)

The default unicode algorithm segments each consonant + mark combination as a character. However, since two consonants are conjoined in kinzis and stacked consonants, “one consonant, one character” rule makes it confusing for selecting text and cursor movement. Currently, pango and ICU segment those conjoined consonants as two separate characters.

အင်္ဂါ = အ + င်္ + ဂါ
အက္ခရာ = အ + က ္ + ခ + ရာ

Windows 8 segments those conjoined characters as one character – which is the proper approach in my opinion.

အင်္ဂါ = အ + င်္ဂါ
အက္ခရာ = အ + က္ခ + ရာ

There is another issue regarding U104E (၎). Myanmar symbol ၎ (U104E), unlike other symbols, is not an independent symbol. It is always written together with Nga, Asat, Visarga. So, ၎င်း should be counted as one character, rather two characters (၎ + င်း).

1 comment:

Ngwe Tun (Solveware Solution)Mar 6, 2012, 8:58:00 PM
current implementation in ICU, work well about user character segmentation for Unicode text including Myanmar. So. You may need to specify Character, Syllable, Word and Cluster.
Just check here. http://userguide.icu-project.org/boundaryanalysis

Pages

Mar 6, 2012

Unicode Character Boundary Segmentation for Burmese

1 comment: