| An optical character recognition software is | | | | image files with text in them are obtained from |
| almost a magical thing: it gives you the power to | | | | the possibility above. Sometimes the user wants |
| "summon" characters, words, propositions, | | | | to make a snapshot of his screen and to process |
| phrases from your favorite book directly into | | | | the text from the resulted snapshot. |
| your favorite text editor. Of course, in this magic | | | | In this case, the best practice is usually to have a |
| act, the almighty hardware have an important role | | | | minimum resolution of 600 dpi, the image has to |
| too, but he is only the brawn, where the OCR | | | | be monochrome and zoomed if possible. |
| software is the brains. | | | | 2. After the image file is obtained the next step is |
| Firstly, a good OCR software would have to be | | | | to process the image file in order to obtain a |
| fully UTF8 capable meaning that it can recognize | | | | better quality thus ensuring a better detection |
| diacritics, special characters from languages like | | | | rate in the next phase of the transformation. |
| Greek, Cyrillic, Swedish, Czech, Polish, Romanian, | | | | For this, obviously, an image editor is needed. |
| etc. | | | | Some of the features that should be present in |
| Beside the "classical" export options to formats as | | | | the image editor would be: |
| pdf, doc, rtf, xls etc, a modern OCR software | | | | - various filters to deskew, despekle, remove the |
| should have integrated as well, database | | | | background noise; |
| integration capabilities. | | | | - basic tools for image editing like zoom, rotate |
| Having database interoperability, the software can | | | | left&right, section selection, etc; |
| ensure integration with document management | | | | - the possibility to create batches of files in order |
| and monitoring tools for personal use or corporate | | | | to automate the process when a large number of |
| use. | | | | image files is required to be processed. |
| There are four phases in the transformation | | | | 3. The most important step is when the magic |
| process from an image containing text to a rich | | | | happens: the extraction of the text from the |
| text format file: | | | | image as editable text. |
| 1. a. The scanning process that involves using | | | | At this step, the user should have the possibility |
| hardware equipment to transform the page from | | | | to choose between various options in order to |
| a physical form to a "brute" electronic form, | | | | improve the detection rate like autocorrection, or |
| usually as a Tagged Image File Format (TIFF). | | | | to just simply convert the common TIFF file into |
| The ideal pages have well contoured letters at a | | | | another format and save it for further use. |
| high size font. Also, they should contain very little | | | | 4. After obtaining the editable text it is the time |
| "salt and pepper noise" caused by dust or dirt | | | | for it to be processed and to be formatted as |
| being present on the scanning surface or even | | | | the user wants. In this case, obviously, an ideal |
| the document being scanned. | | | | OCR software should contain a text editor that |
| Best practice is to use the highest resolution | | | | can handle the export to various file formats like |
| possible (minimum 300 dots per inch - abbreviation | | | | PDF, doc/docx, xls/xlsx, rtf, odt, xml, html etc. |
| dpi) when scanning the document/page.b. Not all | | | | |