Character boundary segmentation for
Burmese is still not working properly in all platforms supporting
Myanmar unicode (Windows , OS X, ICU, Pango). Although the default
character segmentation specification, described in Unicode Standard
Annex #29, works well for most Burmese text, there are, however,
still some cases, where it needs to be tailored be work properly for
Burmese.
Normally, Burmese (user-perceived)
characters are a combination for consonants + marks (medials, vowel,
tones). However, this simple rule doesn't work when final
(devowelized) consonants are involved. Final consonants are
consonants with their inherent vowel killed. There are three ways
final consonants appear in Burmese –
- Followed by U103A (ASAT SIGN) (က်)
- Stacked together with succeeding consonants (က္က)
- As Kinzi (င်္)
Two separate rules for segmentation of
those final consonants need to be considered.
1. When followed by ASAT — (က်,
မ်, င်,
ည် …)
They are segmented as separate
characters in all platforms currently supporting Myanmar unicode
(ICU, Windows8, pango). For example, မောင်
is segmented into two separate characters – မော
and င်.
Although they seem visually separate, မောင်
is considered as one character in Burmese. Let's say —
we want to display “ဒိန်ချဉ်ဆိုင်”
vertically, as in a signboard. This will be displayed
as:
where it should be displayed as:ဒိန်ချဉ်ဆိုင်,
ဒိန်ချဉ်ဆိုင်
2.
Kinzi and Stacked Consonants — (အင်္ဂါ,
အက္ခရာ…)
The default unicode algorithm
segments each consonant + mark combination as a character. However,
since two consonants are conjoined in kinzis and stacked consonants,
“one consonant, one character” rule makes it confusing for
selecting text and cursor movement. Currently, pango and ICU segment those conjoined consonants as two separate characters.
အင်္ဂါ = အ + င်္ + ဂါ
အက္ခရာ = အ + က ္ + ခ + ရာ
Windows 8 segments those conjoined characters as
one character – which is the proper approach in my opinion.
အင်္ဂါ = အ + င်္ဂါ
အက္ခရာ = အ + က္ခ + ရာ
There is another issue regarding U104E (၎). Myanmar symbol ၎ (U104E), unlike other symbols, is not an independent symbol. It is always written together with Nga, Asat, Visarga. So, ၎င်း should be counted as one character, rather two characters (၎ + င်း).
current implementation in ICU, work well about user character segmentation for Unicode text including Myanmar. So. You may need to specify Character, Syllable, Word and Cluster.
ReplyDeleteJust check here. http://userguide.icu-project.org/boundaryanalysis