One main issue that inhibits the easy use of Khmer on computers is the issue of word-breaking. Since Khmer does not use spaces between words, we are forced to use a zero-width-space between words in order for computers to rightly split words. There have been projects seeking to automate Khmer word-breaking, but in large measure they are either too slow or too inaccurate (or both). Recently SBBIC was able to help submit a patch for ICU (International Components for Unicode – http://site.icu-project.org/) that automatically splits Khmer words based on a large word dictionary. You can view the patch here: http://bugs.icu-project.org/trac/ticket/8329
We are hopeful that this patch will pave the road for making it easy to break up Khmer words accurately and make using Khmer Unicode that much better than legacy fonts!
We currently need to continue to collect correctly broken Khmer Unicode documents to add to our Khmer corpus (to help create a better word-breaking dictionary), as well as documents yet to be broken to test the current code with.
Also, we are going to be collecting rules that will help decide when a word starts and ends in Khmer to help make the word-breaker more accurate. If you have some rules please leave them in the comments here so that we can add them.
7 Comments. Leave new
Leave a Reply Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
- Sophat on SBBIC Khmer Unicode Keyboard for Mac OS X
- Nathan Wells on Free English to Khmer and Chuon Nath Dictionary Download
- Sopanha on Download Every Known Khmer Font All At Once
- Vanneth on Khmer Grammar
- Hok on Download All Khmer Unicode Fonts
Thanks so much for ur afford in contributing these very beneficial features to Khmer.
I’m really curious about this new feature function whether it is release yet? and in the future will it be possible to add this feature to Microsoft Office? or even the whole new Khmer keyboard?
Yes, the Khmer word-breaking function has been released in ICU, but there currently is not a graphic user interface that will allow users to use the word-breaker, nor are there any programs that have implemented the word-breaker in their software to my knowledge. There still needs to be some work done on the programming, but it breaks words a lot better than anything else out there that I have seen.
If you or someone you know has experience in C++ and would like to volunteer to help get something working that would make it easier for people to access the current functions in ICU please let me know.
Otherwise we will do our best to keep everyone informed on the progress of the word-breaker.
I’m actually plan to built a code in php or VB.NET for that based on your code. But I find hard to discover all those… I think transform it to php and windows application would make it easier for all users to try it and even improve it.
Hi Jeff, It would be wonderful to see this in an application that would be more user friendly – ICU is a bit of a beast when it comes to usability (but it is the standard for many programs, so that’s why we pursued using it). I’m not sure if you know C++, but if you do, that would be the easiest way because the code is already in C++ – for someone who knows C++ it shouldn’t be too hard to create a small application that would allow work-breaking for files. Let me know if there is anything we can do to help you.
សូមសួរថាកម្មវិធីនេះអាច Support window7 home premium 64bit ឬអត់?
Yes, it can – the code is in C++
There is no user interface yet though, but in the future we hope to have something that will be easy for users of all platforms to use.