« Google Releasing an Office Suite | Main | Google.be, a Home Page to Remember »

Tesseract OCR Released as Open Source by Google

Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. Tesseract OCR is available for download at sourceforge.

A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there. If you know of one that is more accurate, please do tell us!
Bookmark this article at these sites
Post a comment





(Email will remain hidden)





Please enter the security code you see here




Related entries
Email to a friend
Email this article to:


Your email address:


Message (optional):