## Evaluating a Learning Algorithm

### Deciding what to try next

- We know many learning algorithm
- But,how to choose the best algorithm to explore the various techniques
- Here we focus deciding what avenues to try

#### Debugging a learning algorithm

- We have already had a regularized linear regression to predict housing prices

What should we try?

- Get more training data
- Try a smaller set of features
- Try getting additional features
- Building your own, new, better features
- Try decreasing or increasing \(\lambda\)

However, these changes can become major projects(6 months +)

### Evaluating a Hypothesis

Standard way to evaluate a hypothesis is

Split data into two portions

Training Set : Test Set = 7 : 3

Compute the test set error

J

_{test}(θ) = average square error as measured on the test set

- linear regression and logistic regression
Define - misclassification error (0/1 misclassification)

Then the test error is

### Model selection and training validation test sets

Model selection problem

Choose the degree for a polynomial to fit data

Define d = degree of polynomial

- d = 1 (linear)
- d = 2 (quadratic)
- ...
- d = 10

Take each model, minimize with training data which generates a parameter vector, get different \(\theta\)

Compute J

_{test}(\(\theta\))See which model has the lowest test set error

Use that model, how well does it generalize?

- BUT, this is going to be an optimistic estimate of generalization error, because our parameter is fit to that test set (i.e. specifically choose it because the test set error is small)
- So not a good way to evaluate if it will generalize

We need do something a bit different for model selection

Improved model selection

Split the training set into three pieces

Training set : Cross validation : Test set = 6 : 2 : 2

Calculate

Minimize cost function for each of the models as before

Test these hypothesis on the cross validation set to generate the cross validation error

Pick the hypothesis with the lowest cross validation error

Finally

- Estimate generalization error of model using the test set

Final note

- In machine learning as practiced today - many people will select the model using the test set and then check the model is OK for generalization using the test error (which we've said is bad because it gives a bias analysis)
- With a MASSIVE test set this is maybe OK

- But considered much better practice to have separate training and validation sets

- In machine learning as practiced today - many people will select the model using the test set and then check the model is OK for generalization using the test error (which we've said is bad because it gives a bias analysis)

## Bias vs. Variance

### Diagnosis - bias vs. variance

problems :

`- High bias ---- underfitting - High variance ---- overfitting`

Now plot

x = degree of polynomial d

y = error for both training and cross validation (two lines)

CV error and test set error will be very similar

We want to minimize both errors

If cv error is high we're either at the high or the low end of d

- if d is too small --> this probably corresponds to a high bias problem
- if d is too large --> this probably corresponds to a high variance problem

- if d is too small --> this probably corresponds to a high bias problem
For the high bias case, we find both cross validation and training error are high

- Doesn't fit training data well
- Doesn't generalize either

- Doesn't fit training data well
For high variance, we find the cross validation error is high but training error is low

- So we suffer from over fitting (training is low, cross validation is high)
- i.e. training set fits well
- But generalizes poorly

- i.e. training set fits well

- So we suffer from over fitting (training is low, cross validation is high)

### Regularization and bias/variance

The equation above describes fitting a high order polynomial with regularization (used to keep parameter values small)

Consider three cases

- λ = large
- All θ values are heavily penalized
- So most parameters end up being close to zero
- So hypothesis ends up being close to 0
- So high bias -> under fitting data

- λ = intermediate
- Only this values gives the fitting which is reasonable

- λ = small
- λ = 0
- So we make the regularization term 0
- So high variance -> Get overfitting (minimal regularization means it obviously doesn't do what it's meant to)

- λ = large

- How to choose a good value of \(\lambda\)
Define another function

- Define cross validation error and test set errors as before (i.e. without regularization term)

- Choosing \(\lambda\)
- Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
- Create a set of models with different degrees or any other variants.
- Iterate through the λs and for each λ go through all the models to learn some Θ.
- Compute the cross validation error using the learned Θ (computed with λ) on the J
_{CV}(Θ) without regularization or λ = 0. - Select the best combo that produces the lowest error on the cross validation set.
- Using the best combo Θ and λ, apply it on J
_{test}(Θ) to see if it has a good generalization of the problem.

### Learning curves

- What is a learning curve?
- Plot J
_{train}(average squared error on training set) or J_{cv}(average squared error on cross validation set) - Plot against m (number of training examples)
- m is a constant
- So artificially reduce m and recalculate errors with the smaller training set sizes

- J
_{train}- Error on smaller sample sizes is smaller (as less variance to accommodate)
- So as m grows error grows

- J
_{cv}- Error on cross validation set
- When you have a tiny training set your generalize badly
- But as training set grows your hypothesis generalize better
- So cv error will decrease as m increases

- Plot J

What do these curves look like if you have

- High bias
- e.g. setting straight line to data
- J
_{train}- Training error is small at first and grows
- Training error becomes close to cross validation
- So the performance of the cross validation and training set end up being similar (but very poor)

- J
_{cv}- Straight line fit is similar for a few vs. a lot of data
- So it doesn't generalize any better with lots of data because the function just doesn't fit the data
- No increase in data will help it fit

- The problem with high bias is because cross validation and training error are both high
- Also implies that if a learning algorithm as high bias as we get more examples the cross validation error doesn't decrease
- So if an algorithm is already suffering from high bias, more data does not help
- So knowing if you're suffering from high bias is good!
- In other words, high bias is a problem with the underlying way you're modeling your data
- So more data won't improve that model
- It's too simplistic

High variance

- e.g. high order polynomial
- J
_{train}- When set is small, training error is small too
- As training set sizes increases, value is still small
- But slowly increases (in a near linear fashion)
- Error is still low

- J
_{cv}- Error remains high, even when you have a moderate number of examples
- Because the problem with high variance (over fitting) is your model doesn't generalize

- High bias
- An indicative diagnostic that you have high variance is that there's a big gap between training error and cross validation error
- If a learning algorithm is suffering from high variance, more data is probably going to help
So if an algorithm is already suffering from high variance, more data will probably help

### What to do next Revisited

Our decision process can be broken down as follows:

**Getting more training examples:**Fixes high variance**Trying smaller sets of features:**Fixes high variance**Adding features:**Fixes high bias**Adding polynomial features:**Fixes high bias**Decreasing λ:**Fixes high bias**Increasing λ:**Fixes high variance.

#### Diagnosing Neural Networks

- A neural network with fewer parameters is
**prone to underfitting**. It is also**computationally cheaper**. - A large neural network with more parameters is
**prone to overfitting**. It is also**computationally expensive**. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

**Model Complexity Effects:**

- Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
- Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
- In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

# Machine Learning System Design

## Building a Spam Classifier

### Prioritizing What to Work On

The idea of prioritizing what to work is perhaps the most important skill programmers typically need to develop

#### One approach - choosing your own features

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word.

y = spam (1) or not spam (0)

don't recount if a word appears more than once

#### How to spend your time to improve the accuracy of this classifier?

- Collect lots of data (for example "honeypot" project but doesn't always work)
- Develop sophisticated features based on email routine information and message body (for example: using email header data in spam emails)
- Develop algorithms to process your input in different ways (recognizing misspellings in spam).
- Often a research group randomly focus on one option

It is difficult to tell which of the options will be most helpful.

### Error Analysis

The recommended approach to solving machine learning problems is to:

- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data
- Plot learning curves to decide if more data, more features, etc. are likely to help
- Way of avoiding premature optimization
- We should let evidence guide decision making regarding development trajectory

- Way of avoiding premature optimization
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made
- See if you can work out why
- Systematic patterns - help design new features to avoid these shortcomings

- e.g.
- Built a spam classifier with 500 examples in CV set
- Here, error rate is high - gets 100 wrong

- Manually look at 100 and categorize them depending on features
- Looking at those email
- May find most common type of spam emails are pharmacy emails, phishing emails
- See which type is most common - focus your work on those ones

- What features would have helped classify them correctly
- e.g. deliberate misspelling
- Unusual email routing
- Unusual punctuation
- May fine some "spammer technique" is causing a lot of your misses
- Guide a way around it

- May find most common type of spam emails are pharmacy emails, phishing emails

- Built a spam classifier with 500 examples in CV set

- See if you can work out why
- Importance of numerical evaluation
- Have a way of numerically evaluated the algorithm
- If you're developing an algorithm, it's really good to have some performance calculation which gives a single real number to tell you how well its doing
- e.g.
- Say were deciding if we should treat a set of similar words as the same word
- This is done by stemming in NLP (e.g. "Porter stemmer" looks at the etymological stem of a word)
- This may make your algorithm better or worse
- Also worth consider weighting error (false positive vs. false negative)
- e.g. is a false positive really bad, or is it worth have a few of one to improve performance a lot

- Also worth consider weighting error (false positive vs. false negative)
- Can use numerical evaluation to compare the changes
- See if a change improves an algorithm or not

- A single real number may be hard/complicated to compute
- But makes it much easier to evaluate how changes impact your algorithm

- You should do error analysis on the cross validation set instead of the test set

## Handling Skewed Data

### Error Metrics for Skewed Classes

- Once case where it's hard to come up with good error metric - skewed classes
- Example
- Cancer classification
- Train logistic regression model h
_{θ}(x) where- Cancer means y = 1
- Otherwise y = 0

- Test classifier on test set
- Get 1% error
- So this looks pretty good..

- But only 0.5% have cancer
- Now, 1% error looks very bad!

- Get 1% error

- Train logistic regression model h
- So when one number of examples is very small this is an example of skewed classes
- LOTS more of one class than another
- So standard error metrics aren't so good

- Cancer classification
- Another example
- Algorithm has 99.2% accuracy
- Make a change, now get 99.5% accuracy
- Does this really represent an improvement to the algorithm?

- Did we do something useful, or did we just create something which predicts y = 0 more often
- Get very low error, but classifier is still not great

##### Precision and Recall

Two new metrics -

**precision**and**recall**Both give a value between 0 and 1

Evaluating classifier on a test set

For a test set, the actual class is 1 or 0

Algorithm predicts some value for class, predicting a value for each example in the test set

- Considering this, classification can be
- True positive (we guessed 1, it was 1)
- False positive (we guessed 1, it was 0)
- True negative (we guessed 0, it was 0)
- False negative (we guessed 0, it was 1)

- Considering this, classification can be
Precision

How often does our algorithm cause a false alarm?

Of all patients we predicted have cancer, what fraction of them

actually have cancer

- = true positives / # predicted positive
- = true positives / (true positive + false positive)

High precision is good (i.e. closer to 1)

- You want a big number, because you want false positive to be as close to 0 as possible

Recall

- How sensitive is our algorithm?
- Of all patients in set that actually have cancer, what fraction did we correctly detect
- = true positives / # actual positives
- = true positive / (true positive + false negative)

- High recall is good (i.e. closer to 1)
- You want a big number, because you want false negative to be as close to 0 as possible

By computing precision and recall get a better sense of how an algorithm is doing

- This can't really be gamed
- Means we're much more sure that an algorithm is good

Typically we say the presence of a rare class is what we're trying to determine (e.g. positive (1) is the existence of the rare thing)

### Trading Off Precision and Recall

For many applications we want to control the trade-off between precision and recall

Example

- Trained a logistic regression classifier
- Predict 1 if h
_{θ}(x) >= 0.5 - Predict 0 if h
_{θ}(x) < 0.5

- Predict 1 if h
- This classifier may give some value for precision and some value for recall
- Predict 1 only if very confident
- One way to do this modify the algorithm we could modify the prediction threshold
- Predict 1 if h
_{θ}(x) >= 0.8 - Predict 0 if h
_{θ}(x) < 0.2

- Predict 1 if h
- Now we can be more confident a 1 is a true positive
- But classifier has lower recall - predict y = 1 for a smaller number of patients
- Risk of false negatives

- One way to do this modify the algorithm we could modify the prediction threshold
- Another example - avoid false negatives
- This is probably worse for the cancer example
- Now we may set to a lower threshold
- Predict 1 if h
_{θ}(x) >= 0.3- Predict 0 if h
_{θ}(x) < 0.7

- Predict 0 if h

- Predict 1 if h
- i.e. 30% chance they have cancer
- So now we have have a higher recall, but lower precision
- Risk of false positives, because we're less discriminating in deciding what means the person has cancer

- Now we may set to a lower threshold

- This is probably worse for the cancer example

- Trained a logistic regression classifier
This threshold defines the trade-off

We can show this graphically by plotting precision vs. recall

This curve can take many different shapes depending on classifier details

Is there a way to automatically chose the threshold

- Or, if we have a few algorithms, how do we compare different algorithms or parameter sets?

How do we decide which of these algorithms is best?

- We spoke previously about using a single real number evaluation metric
- By switching to precision/recall we have two numbers
- Now comparison becomes harder
- Better to have just one number

- How can we convert P & R into one number?
- One option is the average - (P + R)/2
- This is not such a good solution
- Means if we have a classifier which predicts y = 1 all the time you get a high recall and low precision
- Similarly, if we predict Y rarely get high precision and low recall
- So averages here would be 0.45, 0.4 and 0.51
- 0.51 is best, despite having a recall of 1 - i.e. predict y=1 for everything

- So average isn't great

- This is not such a good solution
**F1Score**(**fscore**)- = 2 * (PR/ [P + R])
- Fscore is like taking the average of precision and recall giving a higher weight to the lower value

- Many formulas for computing comparable precision/accuracy values
- If P = 0 or R = 0 the Fscore = 0
- If P = 1 and R = 1 then Fscore = 1
- The remaining values lie between 0 and 1

- One option is the average - (P + R)/2
Threshold offers a way to control trade-off between precision and recall

Fscore gives a single real number evaluation metric

- If you're trying to automatically set the threshold, one way is to try a range of threshold values and evaluate them on your cross validation set
- Then pick the threshold which gives the best fscore.

- If you're trying to automatically set the threshold, one way is to try a range of threshold values and evaluate them on your cross validation set

## Using Large Data Sets

### Data for Machine Learning

On early videos caution on just blindly getting more data. Turns out under certain conditions getting more data is a very effective way to improve performance.

- Algorithms give remarkably similar performance
- As training set sizes increases accuracy increases
- Take an algorithm, give it more data, should beat a "better" one with less data
- Shows that
- Algorithm choice is pretty similar
- More data helps

We use a learning algorithm with many parameters such as logistic regression or linear regression with many features, or neural networks with many hidden features.

- These are powerful learning algorithms with many parameters which can fit complex functions
- Such algorithms are low bias algorithms
- Little systemic bias in their description - flexible

- Such algorithms are low bias algorithms
- Use a small training set
- Training error should be small

- Use a very large training set
- If the training set error is close to the test set error
- Unlikely to over fit with our complex algorithms
- So the test set error should also be small

Another way to think about this is we want our algorithm to have low bias and low variance.

- Low bias --> use complex algorithm
- Low variance --> use large training set