# Machine Learning Week11 Application Example Photo OCR

## Photo OCR

### Problem Description and Pipeline

#### What is photo OCR problem?

• Photo OCR = photo optical character recognition
• With growth of digital photography, lots of digital pictures
• One idea which has interested many people is getting computers to understand those photos
• The photo OCR problem is getting computers to read text in an image
• Possible applications for this would include
• Make searching easier (e.g. searching for photos based on words in them)
• OCR of documents is a comparatively easy problem
• From photos it's really hard

### OCR pipeline

1. Look through image and find text
2. Do character segmentation
3. Do character classification
4. Optional some may do spell check after this too
• We're not focusing on such systems though

• Pipelines are common in machine learning
• Separate modules which may each be a machine learning component or data processing component
• If you're designing a machine learning system, pipeline design is one of the most important questions
• Performance of pipeline and each module often has a big impact on the overall performance a problem
• You would often have different engineers working on each module
• Offers a natural way to divide up the workload

### Sliding Windows

• As mentioned, stage 1 is text detection

• Unusual problem in computer vision - different rectangles (which surround text) may have different aspect ratios (aspect ratio being height : width)
• Text may be short (few words) or long (many words)
• Tall or short font
• Text might be straight on
• Slanted

#### Pedestrian detection

• Building our detection system
• Have 82 x 36 aspect ratio

• This is a typical aspect ratio for a standing human
• Collect training set of positive and negative examples

• Could have 1000 - 10 000 training examples

• Train a neural network to take an image and classify that image as pedestrian or not

• Gives you a way to train your system
• Now we have a new image - how do we find pedestrians in it?

• Start by taking a rectangular 82 x 36 patch in the image

![](https://bucket-1258741719.cos.ap-beijing.myqcloud.com/Machine-Learning-Week11-Application-Example-Photo-OCR/167.png)
• Run patch through classifier - hopefully in this example it will return y = 0
• Next slide the rectangle over to the right a little bit and re-run

• Then slide again
• The amount you slide each rectangle over is a parameter called the step-size or stride
• Could use 1 pixel
• Best, but computationally expensive
• More commonly 5-8 pixels used
• So, keep stepping rectangle along all the way to the right
• Eventually get to the end
• Then move back to the left hand side but step down a bit too
• Repeat until you've covered the whole image
• Now, we initially started with quite a small rectangle

• So now we can take a larger image patch (of the same aspect ratio)
• Each time we process the image patch, we're resizing the larger patch to a smaller image, then running that smaller image through the classifier
• Hopefully, by changing the patch size and rastering repeatedly across the image, you eventually recognize all the pedestrians in the picture

#### Text detection example

• Like pedestrian detection, we generate a labeled training set with
• Positive examples (some kind of text)
• Negative examples (not text)

• Having trained the classifier we apply it to an image

• So, run a sliding window classifier at a fixed rectangle size

• If you do that end up with something like this

• White region show where text detection system thinks text is

• Different shades of gray correspond to probability associated with how sure the classifier is the section contains text
• Black - no text
• White - text
• For text detection, we want to draw rectangles around all the regions where there is text in the image

• Take classifier output and apply an expansion algorithm

• Takes each of white regions and expands it

• How do we implement this

• Say, for every pixel, is it within some distance of a white pixel?
• If yes then colour it white

• Look at connected white regions in the image above

• Draw rectangles around those which make sense as text (i.e. tall thin boxes don't make sense)

• This example misses a piece of text on the door because the aspect ratio is wrong

#### Stage two is character segmentation

• Use supervised learning algorithm
• Look in a defined image patch and decide, is there a split between two characters?
• So, for example, our first training data item below looks like there is such a split
• Similarly, the negative examples are either empty or hold a full characters

• We train a classifier to try and classify between positive and negative examples

• Run that classifier on the regions detected as containing text in the previous section
• Use a 1-dimensional sliding window to move along text regions

• Does each window snapshot look like the split between two characters?
• If yes insert a split
• If not move on
• So we have something that looks like this

#### Character classification

• Standard OCR, where you apply standard supervised learning which takes an input and identify which character we decide it is

• Multi-class characterization problem

### Getting Lots of Data and Artificial Data

• Two main principles to get much data

• Creating data from scratch
• If we already have a small labeled training set can we amplify it into a larger training set

#### Character recognition as an example of data synthesis

• If we go and collect a large labeled data set will look like this
• The goal is to take an image patch and have the system recognize the character
• Let's treat the images as gray-scale (makes it a bit easer)

• How can we amplify this

• Modern computers often have a big font library
• If you go to websites, huge free font libraries
• For more training data, take characters from different fonts, paste these characters again random backgrounds
• After some work, can build a synthetic training set

• Random background

• Maybe some blurring/distortion filters

• Takes thought and work to make it look realistic

• If you do a sloppy job this won't help!

• So unlimited supply of training examples
• This is an example of creating new data from scratch
• Other way is to introduce distortion into existing data

• e.g. take a character and warp it

![](https://bucket-1258741719.cos.ap-beijing.myqcloud.com/Machine-Learning-Week11-Application-Example-Photo-OCR/161.png)
• 16 new examples
• Allows we amplify existing training set
• This, again, takes though and insight in terms of deciding how to amplify

#### Another example: speech recognition

• Learn from audio clip - what were the words
• Have a labeled training example
• Introduce audio distortions into the examples
• So only took one example
• Created lots of new ones!
• When introducing distortion, they should be reasonable relative to the issues your classifier may encounter

#### Getting more data

• Before creating new data, make sure you have a low bias classifier

• Plot learning curve
• If not a low bias classifier increase number of features

• Then create large artificial training set
• Very important question: How much work would it be to get 10x data as we currently have?

• Often the answer is, "Not that hard"
• This is often a huge way to improve an algorithm
• How many minutes/hours does it take to get a certain number of examples

• Say we have 1000 examples
• 10 seconds to label an example
• So we need another 9000 - 90000 seconds
• Comes to a few days (25 hours!)
• Crowd sourcing is also a good way to get data

• Risk or reliability issues
• Cost

• Example

• E.g. Amazon mechanical turks

### Ceiling Analysis: What Part of the Pipeline to Work On Next

#### Photo OCR pipeline

• Three modules
• Each one could have a small team on it
• Where should you allocate resources?
• Good to have a single real number as an evaluation metric
• So, character accuracy for this example
• Find that our test set has 72% accuracy

#### Ceiling analysis on our pipeline

• We go to the first module

• Mess around with the test set - manually tell the algorithm where the text is
• Simulate if your text detection system was 100% accurate

• So we're feeding the character segmentation module with 100% accurate data now
• How does this change the accuracy of the overall system

• Accuracy goes up to 89%
• Next do the same for the character segmentation

• Accuracy goes up to 90% now
• Finally doe the same for character recognition

• Goes up to 100%
• Having done this we can qualitatively show what the upside to improving each module would be

• Perfect text detection improves accuracy by 17%!
• Would bring the biggest gain if we could improve
• Perfect character segmentation would improve it by 1%
• Not worth working on
• Perfect character recognition would improve it by 10%
• Might be worth working on, depends if it looks easy or not
• The "ceiling" is that each module has a ceiling by which making it perfect would improve the system overall

#### Other example - face recognition

• NB this is not how it's done in practice

• Probably more complicated than is used in practice
• How would you do ceiling analysis for this

• Overall system is 85%
• + Perfect background -> 85.1%
• Not a crucial step
• + Perfect face detection -> 91%
• Most important module to focus on
• + Perfect eyes ->95%
• + Perfect nose -> 96%
• + Perfect mouth -> 97%
• + Perfect logistic regression -> 100%
• Cautionary tale

• Two engineers spent 18 months improving background pre-processing
• Turns out had no impact on overall performance
• Could have saved three years of man power if they'd done ceiling analysis