The Photo OCR Problem

Photo OCR stands for Photo-Optical Character Recognition. For example, in the following picture, how can we detect where the texts are and what are contents of the texts?

Other examples are like helping navigate blind people and cars.

Photo OCR pipeline

To solve this problem we need a pipeline which consists of several components(i.e. a sequence of different modules, some of them involve machine learning). These components are probably dealt with by different developers in a team. The performance of each component may have a big impact on the final performance of your algorithm.

Some systems do more complex things, like spelling correction at the end. E.g. c1eaning -> cleaning (number one is similar to character l)

The text detection step is similar to pedestrian detection. But the difference is that the ratio of width and height of the sliding window for pedestrians is nearly the same. For text detection, the size or ratio may vary.

For pedestrian detection, the first step we need to do is collecting enough training examples (both positive and negative) in a certain size/ratio, then use it to train a model (e.g. a neural network).

Then apply the trained model to test data. In the following picture, using a sliding window to check every patch of the photo and flag each patch pedestrian or not.

Similarly, use this way to train a character detection model as well.

Use sliding window and use white to show where the classifier thinks it has found text. And different shades of grey correspond to the probability that was output by the classifier.

Then we take the output of the classifier and apply it to what is called expansion operator. It expands each white region and make them as a whole. Mathematically, it just checks the pixels that in a certain ranges closing to the white region and then colors them white.

If some white regions are tall and thin (high ratio of height and width), we eliminate them. Then we can draw bounding boxes around the left white regions.

Text detection in a picture is still a difficult problem now. For example, the texts on transparent windows are not easy to detect.

Next, we need to segment the characters in each bounding box. So we train another model to detect the white space between each character.

Then use sliding window to detect all the while spaces and split the characters.

The final step is the character classification which is easy to implement.

Artificial data synthesis

There are two ways to synthesize data: First, synthesize data from scratch; Second, use data augmentation.

For the first way, there is an example. Just use different fonts on your computer and add some different backgrounds. We need to make sure the synthetic data looks similar to real data.

For the second way, we just amplify the existing data. (e.g. use distortions, etc.).

We need to make sure our distortions are reasonable, namely no arbitrary distortion.

Note: Make sure you have a low bias classifier before expanding the effort (i.e. before collecting more training data). You can keep increasing the number of features/number of hidden units in neural network until you have a low bias classifier.

Ceiling analysis: What part of the pipeline to work on next.

Ceiling analysis is mainly to help you put your main efforts on the important components (can improve the performance of your model relatively significant) in the pipeline.

Ceiling analysis is just estimating the errors due to each component. It analyzes the upper bound of each component.

For example, we just do text detection manually, namely, give it the perfect answer to where is the text in a photo to check how much we can improve the performance of the system. Then we go to the second component and do it manually and perfectly to see how much is the improvement.

Finally, I need to mention that sliding window is not a very efficient algorithm (i.e. computationally expensive). At present, there are algorithms combining Convolutional Neural Network (CNN) which makes object detection much more efficient like R-CNN, YOLO, etc. I will write some related articles in the future.