Khmer Word-Breaking Patch for ICU Coming Soon! - Society for Better Books in Cambodia

One main issue that inhibits the easy use of Khmer on computers is the issue of word-breaking. Since Khmer does not use spaces between words, we are forced to use a zero-width-space between words in order for computers to rightly split words. There have been projects seeking to automate Khmer word-breaking, but in large measure they are either too slow or too inaccurate (or both). Recently SBBIC was able to help submit a patch for ICU (International Components for Unicode – http://site.icu-project.org/) that automatically splits Khmer words based on a large word dictionary. You can view the patch here: http://bugs.icu-project.org/trac/ticket/8329

We are hopeful that this patch will pave the road for making it easy to break up Khmer words accurately and make using Khmer Unicode that much better than legacy fonts!

We currently need to continue to collect correctly broken Khmer Unicode documents to add to our Khmer corpus (to help create a better word-breaking dictionary), as well as documents yet to be broken to test the current code with.

Also, we are going to be collecting rules that will help decide when a word starts and ends in Khmer to help make the word-breaker more accurate. If you have some rules please leave them in the comments here so that we can add them.

Thanks!

ឆ្លើយតប

This site uses Akismet to reduce spam. Learn how your comment data is processed.

ឆ្លើយតប

ឧបត្ថម្ភដោយ

ស្វែងរក

ឆ្លើយ​តប បោះ​បង់​ការ​ឆ្លើយ​តប

ឧបត្ថម្ភ​ដោយ

ស្វែងរក

ឆ្លើយតប

ឧបត្ថម្ភដោយ