We’ve been working on getting code into ICU to allow Khmer Unicode to automatically break between words and the newest release of ICU now includes a Khmer word breaker.  But access is difficult (unless you are a programmer).  So we have made a small program that uses ICU and will allow you to use the Khmer word breaker in Linux (Windows will come soon).  We’ve only tested this on Ubuntu 11.x so please test it and let us know if you have any problems. There is still room for improvement, so please let us know how it works for you.

The word-breaker is currently dictionary based, so it will work best on documents that have correct spelling.  In the future we hope to add additional programming that will better deal with “unknown” words.

To use the program in Ubuntu place the Unicode .txt file you want to break in the same directory as sbbic-khmer-breaker.out and open the console to the directory where sbbic-khmer-breaker.out is located and type: ./sbbic-khmer-breaker.out yourinputfile.txt youroutputfile.txt (changing the names of the text files to the names you desire).

Again, if you have any issues, please don’t hesitate to ask in the comments.

Download “SBBIC Khmer Word Breaker” SBBIC-Khmer-Word-Breaker-1.0.zip – Downloaded 2220 times – 7.10 MB

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

This site uses Akismet to reduce spam. Learn how your comment data is processed.