TopOCR's OCR Engines
TopOCR is equipped with two different OCR engines and lets you select the OCR engine that's best suited for your particular documents.
You can select between TopOCR, which offers reasonable accuracy on high quality images combined with very fast processing speed.
Or you can select Tesseract OCR, which provides much higher accuracy on images where the characters have some type of distortion, but at the expense of requiring much more processing time.
What do we mean by distortion? Characters can become distorted from a number of reasons.
They can be skewed, or the curvature of a book can cause character distortion.
Sometimes the properties of the paper they're printed on can lead to distortion.
For example, newsprint and paperback books are printed on a cheaper type of paper where the ink can "bleed" into the paper, making the text appear "fuzzy".
Higher quality images are generally found with magazines and hardback books because they have better paper quality.
Whichever OCR engine you select, you can rely on the fact that the accuracy of each individual OCR engine is greatly enhanced by TopOCR's image processing functions that provide ultra-fast fixed-point adaptive binarization, background removal, document layout analysis, small print enhancement and column straightening.
TopOCR (Shape Analysis Static Classifier Architecture)
TopOCR can read eleven different languages (English, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish) and is the fastest OCR engine on the planet!
It works on the principle of analyzing the shape of characters and using a high speed decision tree for classification.
If you have good quality images without a lot of character distortion or noise, TopOCR can be an effective choice, especially if you're running on a low powered CPU.
TopOCR is also your best choice for extracting text from PDF files where its high speed will allow you to read several pages per second on a fast PC!
Tesseract OCR (LSTM Recurrent Neural Network Architecture)
The primary character classifier function in Tesseract OCR is based on an implementation of a Long Short-Term Memory neural network or LSTM network.
LSTM neural networks outperform all other alternative neural network architecture models for this type of pattern recognition and also outperform the more "classical" character recognition algorithms used by the top selling commerical OCR products.
For example, an LSTM network achieved the best known results in unsegmented connected handwriting recognition, and in 2009 won the ICDAR handwriting competition.
The accuracy of an LSTM network is heavily dependent on the training data.
The training data used in the new Tesseract LSTM included a significant amount of degraded images produced by cameras.
To date, the Tesseract LSTM OCR engine is the most accurate OCR engine we have ever tested with camera images by a very wide margin.
If Tesseract's LSTM recognizer fails on a particular character sequence, it can "fall-back" to its generic static shape classifier to make the determination.
The amount of computation required for LSTM network character recognition is about 50 times greater than for character recognition performed using a static classifier. To help speed up the processing, we are utilizing SSE instructions for the inner neural network calculations. We have also achieved about a 3X over-all performance increase in reading speed by making extensive use of hyper-threading (running on multiple-CPUs) in the most CPU intensive portions of the OCR and image processing functions. On a standard DeskTop PC using a 4-core Intel 3.4GHz i7-6700 CPU, our implementation of Tesseract's LSTM neural network OCR engine takes about 5 seconds to read a 5.0 MP image and TopOCR's image pre-processing (binarization, straighten columns) adds about another second. For comparison, one of the new 8-core Ryzen CPUs from AMD will read a page in under 3 seconds! Because of the enormous performance improvement achieved by using multi-processing, we recommend ONLY running TopOCR on a 4-core or better CPU. As 8-core CPUs become more mainstream, TopOCR will already be equipped to maximize performance for these CPUs.