In 2004, Google unleashed an ambitious plan under the simple name of Google Print, which was later changed and is now known as Google Book Search (or simply Google Books). The goal: to make the books of the entire world available on the internet. If you thought this was not a scientific task but merely one that required a lot of people to scan a lot of books and upload them, think again. Pause for a moment and think about how it is not possible to search an image file (the typical output of a scanned document) for a particular keyword you might be searching for. Yet, when you type in a key word on Google Books, it is searched for and presented in the uploaded book where it occurs. How do they do it?
The answer to that question is Optical Character Recognition, commonly termed as OCR. OCR is the process by which an image is searched for typewritten, handwritten or printed text and thereby converted into a machine readable/executable text format. The input documents are typically scanned files, in a generic image format or a PDF format. OCR, as in the form that it is available today, has its birth somewhere in the 1950s, when a US Armed Forces Security Agency cryptanalyst invented a machine that could process printed documents into machine readable and editable formats for computer processing. Since then, a number of innovations have been made in the field with simultaneous developments in the field of information technology in general. Even today, OCR is a challenging research field with widespread commercial applications, such as book search and indexing, postal address recognition, conversion of government documents into e archives, and the list goes on.
The primary route taken for OCR processing is structural analysis and pattern matching, in which the different shapes occurring in the image are correlated statistically to the different letters of the language and thereby the closest candidate is selected for output in the machine readable font. Earlier versions of the OCRs used to be specific for a particular font but OCRs today can recognize characters in most of the fonts available for the language.
Some of the popular OCR softwares today include Ocrad, ABBYY Fine Reader, Brainware, and Tesseract, out of which ABBYY and Tesseract offer multi language support. Most of these softwares are licensed and has to be bought to be able to use them. They accept several different types of image format such as JPEG, TIFF, GIF, etc. as well as PDF formats and output the result in a standard text document format.
Even then, most OCRs today are specific to one language (or a few related languages) for which they are tailor made, and this language is more often than not, inevitably English. Online solutions offer the latest in multi language recognition technology of OCR, combined with the provision that you don not have to download licensed software on to your PC. Moreover, it is absolutely free, and the output file is ready for download immediately, without having to submit your email and wait for it to arrive in your inbox.
Author Resource:
Online OCR is the process by which an image is searched for typewritten, handwritten or printed text and thereby converted into a machine readable/executable text format. A common application is converting files from PDF to Text online for editing purposes.