Khmer OCR

Convert scanned Khmer documents into Khmer Unicode using There have been many attempts at creating a viable solution for converting scanned Khmer text into Khmer Unicode, but all have fallen short of actually being useful. But utilizes machine learning, and with additional training data provided by volunteers it can “learn” to convert new fonts with very high accuracy. This makes the solution flexible and viable because non-programmers can “teach” the software to correctly convert a scanned document into Khmer Unicode. VISIT:
Read More

Khmer OCR Software by PAN

Optical character recognition software for the Khmer language. It is only trained for Limon R1 (you can try it with other fonts, but it might not be accurate). PAN Cambodia has since ceased to develop this software, but you can use it as is. Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text (source: Wikipedia). Download “” – Downloaded 10920 times – 27.42 MB UPDATE (6-26-2014): There are a few projects in the works concerning Khmer OCR: A 1 year project funded by…
Read More

Automatic Line-Breaking for Khmer Now Available!

We are pleased to announce that LibreOffice Pre-Release 3.6 (Download: LibO-Dev_3.6.0.0.beta2_Win_x86_install_multi.msi or newer) now incorperates the latest ICU version which has the ability to automatically line-break Khmer Unicode (which we posted about previously here). This means you no longer have to manually add a zero-width space between words in order to correctly line-break in your documents! The screen-shots below show a sample LibreOffice document in LibreOffice 3.5 (that does not automatically line-break Khmer), a document with manual zero-width spaces added, and a document in LibreOffice Dev 3.6 with automatic Khmer line-breaking. As you can see the results are looking good! LibreOffice…
Read More

SBBIC Khmer Word Breaker Using ICU

We’ve been working on getting code into ICU to allow Khmer Unicode to automatically break between words and the newest release of ICU now includes a Khmer word breaker.  But access is difficult (unless you are a programmer).  So we have made a small program that uses ICU and will allow you to use the Khmer word breaker in Linux (Windows will come soon).  We’ve only tested this on Ubuntu 11.x so please test it and let us know if you have any problems. There is still room for improvement, so please let us know how it works for you.…
Read More

New Khmer Unicode Word Breaker in the Works

We’ve been testing a new Java application to use for Khmer word breaking.  As you know, Khmer does not use spaces between words, and that causes some difficulties when using Khmer with a computer. We’ve tested a new Java application (click here to download the unmodified source or view link at the bottom to download the latest Khmer dictionary with a built version) against the two current solutions and the results are promising (special thanks to Dave Jarvis the author for his willingness to let us use his application and even help us with making it work with Khmer). Here’s…
Read More

KhmerOS Automatic Word Separation (ZWSP) Program

This program goes through a Khmer Unicode text in UTF-8 format and inserts ZWSP characters between the words. It separates words using an internal dictionary (based on the Chuon Nat dictionary). It can handle UTF-8 format files, even if these files are in HTML/XML. It can also deal with simple RTF files. Download: KhmerOS Automatic Word Separation (ZWSP) Program NOTE: you need to have the Java Runtime Environment installed in your computer (which you can download here). It runs on any platform that has java installed.
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

This site uses Akismet to reduce spam. Learn how your comment data is processed.