We are pleased to announce that LibreOffice Pre-Release 3.6 (Download: LibO-Dev_3.6.0.0.beta2_Win_x86_install_multi.msi or newer) now incorperates the latest ICU version which has the ability to automatically line-break Khmer Unicode (which we posted about previously here). This means you no longer have to manually add a zero-width space between words in order to correctly line-break in your documents! The screen-shots below show a sample LibreOffice document in LibreOffice 3.5 (that does not automatically line-break Khmer), a document with manual zero-width spaces added, and a document in LibreOffice Dev 3.6 with automatic Khmer line-breaking. As you can see the results are looking good!
The automatic word-breaking does not yet currently work for spell checking, so in order to spell check in Khmer you will still need to continue to manually add zero-width spaces between words – but this is a great step forward for the Khmer language on computers! And hopefully in the near future we will no longer need to manually add spaces between words in Khmer in order to spell check.
Please try out the new LibreOffice pre-release and let us know how it works for you. Any issues you have with line-breaking (if something breaks incorrectly), please let us know in the comments so we can work towards debugging and increase the accuracy of the word-breaker in ICU. Special thanks to George for helping us make this a reality.
Our Sponsors
Help Us
Search
Recent Comments
- Sophat on SBBIC Khmer Unicode Keyboard for Mac OS X
- Nathan Wells on Free English to Khmer and Chuon Nath Dictionary Download
- Sopanha on Download Every Known Khmer Font All At Once
- Vanneth on Khmer Grammar
- Hok on Download All Khmer Unicode Fonts
13 Comments. Leave new
I am curious how this line breaking mechanism plays with names of places and/or people (or other words not in the corpus). Any ideas?
Yes, being that it is a dictionary based line-breaker it will have trouble with words not in the dictionary. We have some rules implemented that help (like never break a word after the jung sign ្ ), but more work still needs to be done. If you have any insight it would be appreciated – you can see the code here: http://source.icu-project.org/repos/icu/icu/trunk/source/common/dictbe.cpp
We’ve been in contact with someone who has experience using a Hidden Markov Model with Khmer – but he has been quite busy and has not had the time to figure out a way implement it with icu.
I used some rules to break the words correctly for the concordance. I developer friend of mine made a program that based upon rules and a word list of mine, would break the text into individual words. So, the rules were very important. I can pass the rules on to you if you are interested.
Hi Adam,
It would be great to get the rules – hopefully it will help the ICU break iterator perform with higher accuracy.
any tutorial how to do it?
Hello Bunthearith,
Just download LibreOffice 3.6 and then when you write in Khmer it will automatically line break for you. If you have any trouble, let us know.
តើកម្មវិធីនេះអាចប្រើជាមួយ Windows 8 64bit បានដែរឬទេ?
Yes it can.
how about windows 7 64bit?
how can I download this software?
រូបមើលអត់ឃើញទេបង តើវាតំណើរការយ៉ាងដូចម្តេច? ខ្ញុំប្រើដូចអត់មានឃើញ Auto Breaking ផងហ្នឹងបង?
after i open a document, do i need to click on some button to let the software insert zero-width spaces between words or will it do itself once open?
Hello Boran,
LibreOffice will automatically break Khmer words (you won’t be able to tell except on line-breaks). You can also add your own zero-width-spaces if you want to control it manually.