Apr 6, 2012

Python Module for Myanmar Language Processing

I am writing a crawler, to crawl Burmese phrases from the Web, in python currently. Since non-unicode encodings — such as wininnwa and zawgyi — are also used commonly in Myanmar websites, I needed to convert those texts into unicode.

So, I ported Thanlwinsoft's Myanmar converter, written in javascript, to python. Since, converting those encodings is a common task in Myanmar language processing, I am releasing it here.

I have also written a python script, which can be used to convert those encodings from command line.

Please note that I haven't tested it throughly, and this is an alpha release. So, don't use it without backing up first.

Install


For debian/ubuntu users, you can just grab the deb package and install. For other platforms, download the archive and run this command in extracted folder.

python setup.py install

Usage


Using myanmar python module

   >>> import myanmar.converter as converter
   >>> converter.get_available_encodings ()
   ['zawgyi', 'wwin_burmese', 'wininnwa', 'unicode']   
   >>> print converter.convert (u'ydkifoGefy½dk*&rfrif;', 'wininnwa', 'unicode')
   ပိုင်​သွန်​ပ​ရို​ဂ​ရမ်​မင်း

Using myanmar-converter script

trhura @ ~ $ myanmar-converter -h
Usage: myanmar-converter [OPTIONS...] [FILE...]

Convert between Myanmar legacy encodings and unicode.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -l, --list            List supported encodings.
  -f ENCODING, --from=ENCODING
                        Convert characters from ENCODING
  -t ENCODING, --to=ENCODING
                        Convert characters to ENCODING
  -o FILE, --output=FILE
                        Write output to FILE

trhura @ ~ $ echo 'wmrDe,frStokH;ðyykH' |  \
 myanmar-converter -f 'wwin_burmese' -t 'unicode'
တာ​မီ​နယ်​မှ​အသုံးပြု​ပုံ

Mar 6, 2012

Unicode Character Boundary Segmentation for Burmese


        Character boundary segmentation for Burmese is still not working properly in all platforms supporting Myanmar unicode (Windows , OS X, ICU, Pango). Although the default character segmentation specification, described in Unicode Standard Annex #29, works well for most Burmese text, there are, however, still some cases, where it needs to be tailored be work properly for Burmese.

         Normally, Burmese (user-perceived) characters are a combination for consonants + marks (medials, vowel, tones). However, this simple rule doesn't work when final (devowelized) consonants are involved. Final consonants are consonants with their inherent vowel killed. There are three ways final consonants appear in Burmese –
  1. Followed by U103A (ASAT SIGN) (က်)
  2. Stacked together with succeeding consonants (က္က)
  3. As Kinzi (င်္)
         Two separate rules for segmentation of those final consonants need to be considered.

1. When followed by ASAT — (က်, မ်, င်, ည် …)

         They are segmented as separate characters in all platforms currently supporting Myanmar unicode (ICU, Windows8, pango). For example, မောင် is segmented into two separate characters – မော and င်. Although they seem visually separate, မောင် is considered as one character in Burmese. Let's say — we want to display “ဒိန်ချဉ်ဆိုင်” vertically, as in a signboard. This will be displayed as:
ဒိ
န်
ချ
ဉ်
ဆို
င်,
where it should be displayed as:
ဒိန်
ချဉ်
ဆိုင်

2. Kinzi and Stacked Consonants — (အင်္ဂါ, အက္ခရာ…)
 
        The default unicode algorithm segments each consonant + mark combination as a character. However, since two consonants are conjoined in kinzis and stacked consonants, “one consonant, one character” rule makes it confusing for selecting text and cursor movement. Currently, pango and ICU segment those conjoined consonants as two separate characters.
အင်္ဂါ = + င်္ + ဂါ
အက္ခရာ = + က ္ + + ရာ
        Windows 8 segments those conjoined characters as one character – which is the proper approach in my opinion.
အင်္ဂါ = + င်္ဂါ
အက္ခရာ = + က္ခ + ရာ
        There is another issue regarding U104E (). Myanmar symbol (U104E), unlike other symbols, is not an independent symbol. It is always written together with Nga, Asat, Visarga. So, ၎င်း should be counted as one character, rather two characters (+ င်း).

Jan 13, 2012

tic-tac-toe AI with two-ply minmax in scheme

Today, while I am reorganizing my old programming source code files, I stumbled upon this old gem —  an implementation of tic-tac-toe AI using two-ply min-max in scheme — which is my project for AI course in university.

Although it was only a half and a year ago, I have already forgot how it works. Besides at that time, I wrote it using plt-scheme, which has become racket now. So, I decided to find out whether the program still works under racket and went through this unbearable pang of downloading racket — a familiar emotion for Internet users inside Myanmar. After waiting for a few hours, I was ready to animate my old friend.



Yeah, it still works, and as far as I tested, the AI is unbeatable. Since this kind of projects are common in university AI courses, I decided to post it online to help some pour souls trying to meet his assignment deadline — that I was once.

Please, note that the evaluation function is too complex and poorly written that I even myself had a hard time understanding. So, you might want to rework that one.

To try out this program, just install racket & run.
sudo apt-get install racket
racket ttc.scm

Jan 11, 2012

nautilus-renamer 3.0 released

Highlights
The main change is that now nautilus-renamer uses PyGobject, instead of static pygtk bindings. In other words, it uses gtk3 now. The other big change is radio buttons in previous versions are replaced with buttons, so that patterns, substitutions and case can be applied simultaneously. Nautilus extension is also added by Amr Osman. Debian packages are also available now.

Download

Downloads are available here. If you want to install as a nautilus script for a single user, download source package, extract and type "make localinstall" in terminal to install. Download the deb package to install as nautilus extension. After installation, "Mass Rename" will appear in nautilus context menu, if you have selected more than 2 files.

Screenshots