Apr 6, 2012

Python Module for Myanmar Language Processing

I am writing a crawler, to crawl Burmese phrases from the Web, in python currently. Since non-unicode encodings — such as wininnwa and zawgyi — are also used commonly in Myanmar websites, I needed to convert those texts into unicode.

So, I ported Thanlwinsoft's Myanmar converter, written in javascript, to python. Since, converting those encodings is a common task in Myanmar language processing, I am releasing it here.

I have also written a python script, which can be used to convert those encodings from command line.

Please note that I haven't tested it throughly, and this is an alpha release. So, don't use it without backing up first.

Install


For debian/ubuntu users, you can just grab the deb package and install. For other platforms, download the archive and run this command in extracted folder.

python setup.py install

Usage


Using myanmar python module

   >>> import myanmar.converter as converter
   >>> converter.get_available_encodings ()
   ['zawgyi', 'wwin_burmese', 'wininnwa', 'unicode']   
   >>> print converter.convert (u'ydkifoGefy½dk*&rfrif;', 'wininnwa', 'unicode')
   ပိုင်​သွန်​ပ​ရို​ဂ​ရမ်​မင်း

Using myanmar-converter script

trhura @ ~ $ myanmar-converter -h
Usage: myanmar-converter [OPTIONS...] [FILE...]

Convert between Myanmar legacy encodings and unicode.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -l, --list            List supported encodings.
  -f ENCODING, --from=ENCODING
                        Convert characters from ENCODING
  -t ENCODING, --to=ENCODING
                        Convert characters to ENCODING
  -o FILE, --output=FILE
                        Write output to FILE

trhura @ ~ $ echo 'wmrDe,frStokH;ðyykH' |  \
 myanmar-converter -f 'wwin_burmese' -t 'unicode'
တာ​မီ​နယ်​မှ​အသုံးပြု​ပုံ