Machine Learning Week6 Advice for Applying Machine Learning

Evaluating a Learning Algorithm

Deciding what to try next

• We know many learning algorithm
• But,how to choose the best algorithm to explore the various techniques
• Here we focus deciding what avenues to try

Debugging a learning algorithm

• We have already had a regularized linear regression to predict housing prices

• What should we try?

• Get more training data
• Try a smaller set of features
• Building your own, new, better features
• Try decreasing or increasing $\lambda$

However, these changes can become major projects(6 months +)

Evaluating a Hypothesis

• Standard way to evaluate a hypothesis is

• Split data into two portions

Training Set : Test Set = 7 : 3

• Compute the test set error

• Jtest(θ) = average square error as measured on the test set

• linear regression and logistic regression
• Define - misclassification error (0/1 misclassification)

• Then the test error is

Model selection and training validation test sets

• Model selection problem

• Choose the degree for a polynomial to fit data

• Define d = degree of polynomial

• d = 1 (linear)
• ...
• d = 10
• Take each model, minimize with training data which generates a parameter vector, get different $\theta$

• Compute Jtest($\theta$)

• See which model has the lowest test set error

• Use that model, how well does it generalize?

• BUT, this is going to be an optimistic estimate of generalization error, because our parameter is fit to that test set (i.e. specifically choose it because the test set error is small)
• So not a good way to evaluate if it will generalize
• We need do something a bit different for model selection

• Improved model selection

• Split the training set into three pieces

Training set : Cross validation : Test set = 6 : 2 : 2

• Calculate

• Minimize cost function for each of the models as before

• Test these hypothesis on the cross validation set to generate the cross validation error

• Pick the hypothesis with the lowest cross validation error

• Finally

• Estimate generalization error of model using the test set
• Final note

• In machine learning as practiced today - many people will select the model using the test set and then check the model is OK for generalization using the test error (which we've said is bad because it gives a bias analysis)
• With a MASSIVE test set this is maybe OK
• But considered much better practice to have separate training and validation sets

Bias vs. Variance

Diagnosis - bias vs. variance

• problems :

  - High bias          ---- underfitting
- High variance   ---- overfitting

• Now plot

• x = degree of polynomial d

• y = error for both training and cross validation (two lines)

• CV error and test set error will be very similar

• We want to minimize both errors

• If cv error is high we're either at the high or the low end of d

• if d is too small --> this probably corresponds to a high bias problem
• if d is too large --> this probably corresponds to a high variance problem
• For the high bias case, we find both cross validation and training error are high

• Doesn't fit training data well
• Doesn't generalize either
• For high variance, we find the cross validation error is high but training error is low

• So we suffer from over fitting (training is low, cross validation is high)
• i.e. training set fits well
• But generalizes poorly

Regularization and bias/variance

• The equation above describes fitting a high order polynomial with regularization (used to keep parameter values small)

• Consider three cases

• λ = large
• All θ values are heavily penalized
• So most parameters end up being close to zero
• So hypothesis ends up being close to 0
• So high bias -> under fitting data
• λ = intermediate
• Only this values gives the fitting which is reasonable
• λ = small
• λ = 0
• So we make the regularization term 0
• So high variance -> Get overfitting (minimal regularization means it obviously doesn't do what it's meant to)

• How to choose a good value of $\lambda$
• Define another function

• Define cross validation error and test set errors as before (i.e. without regularization term)
• Choosing $\lambda$
• Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
• Create a set of models with different degrees or any other variants.
• Iterate through the λs and for each λ go through all the models to learn some Θ.
• Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0.
• Select the best combo that produces the lowest error on the cross validation set.
• Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.

Learning curves

• What is a learning curve?
• Plot Jtrain (average squared error on training set) or Jcv (average squared error on cross validation set)
• Plot against m (number of training examples)
• m is a constant
• So artificially reduce m and recalculate errors with the smaller training set sizes
• Jtrain
• Error on smaller sample sizes is smaller (as less variance to accommodate)
• So as m grows error grows
• Jcv
• Error on cross validation set
• But as training set grows your hypothesis generalize better
• So cv error will decrease as m increases

• What do these curves look like if you have

• High bias
• e.g. setting straight line to data
• Jtrain
• Training error is small at first and grows
• Training error becomes close to cross validation
• So the performance of the cross validation and training set end up being similar (but very poor)
• Jcv
• Straight line fit is similar for a few vs. a lot of data
• So it doesn't generalize any better with lots of data because the function just doesn't fit the data
• No increase in data will help it fit
• The problem with high bias is because cross validation and training error are both high
• Also implies that if a learning algorithm as high bias as we get more examples the cross validation error doesn't decrease
• So if an algorithm is already suffering from high bias, more data does not help
• So knowing if you're suffering from high bias is good!
• In other words, high bias is a problem with the underlying way you're modeling your data
• So more data won't improve that model
• It's too simplistic
• High variance

• e.g. high order polynomial
• Jtrain
• When set is small, training error is small too
• As training set sizes increases, value is still small
• But slowly increases (in a near linear fashion)
• Error is still low
• Jcv
• Error remains high, even when you have a moderate number of examples
• Because the problem with high variance (over fitting) is your model doesn't generalize
• An indicative diagnostic that you have high variance is that there's a big gap between training error and cross validation error
• If a learning algorithm is suffering from high variance, more data is probably going to help
• So if an algorithm is already suffering from high variance, more data will probably help

What to do next Revisited

Our decision process can be broken down as follows:

• Getting more training examples: Fixes high variance

• Trying smaller sets of features: Fixes high variance

• Adding features: Fixes high bias

• Adding polynomial features: Fixes high bias

• Decreasing λ: Fixes high bias

• Increasing λ: Fixes high variance.

Diagnosing Neural Networks

• A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
• A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

Model Complexity Effects:

• Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
• Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
• In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

Machine Learning System Design

Building a Spam Classifier

Prioritizing What to Work On

The idea of prioritizing what to work is perhaps the most important skill programmers typically need to develop

One approach - choosing your own features

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word.

y = spam (1) or not spam (0)

don't recount if a word appears more than once

How to spend your time to improve the accuracy of this classifier?

• Collect lots of data (for example "honeypot" project but doesn't always work)
• Develop sophisticated features based on email routine information and message body (for example: using email header data in spam emails)
• Develop algorithms to process your input in different ways (recognizing misspellings in spam).
• Often a research group randomly focus on one option

It is difficult to tell which of the options will be most helpful.

Error Analysis

The recommended approach to solving machine learning problems is to:

• Start with a simple algorithm, implement it quickly, and test it early on your cross validation data
• Plot learning curves to decide if more data, more features, etc. are likely to help
• Way of avoiding premature optimization
• We should let evidence guide decision making regarding development trajectory
• Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made
• See if you can work out why
• Systematic patterns - help design new features to avoid these shortcomings
• e.g.
• Built a spam classifier with 500 examples in CV set
• Here, error rate is high - gets 100 wrong
• Manually look at 100 and categorize them depending on features
• Looking at those email
• May find most common type of spam emails are pharmacy emails, phishing emails
• See which type is most common - focus your work on those ones
• What features would have helped classify them correctly
• e.g. deliberate misspelling
• Unusual email routing
• Unusual punctuation
• May fine some "spammer technique" is causing a lot of your misses
• Guide a way around it
• Importance of numerical evaluation
• Have a way of numerically evaluated the algorithm
• If you're developing an algorithm, it's really good to have some performance calculation which gives a single real number to tell you how well its doing
• e.g.
• Say were deciding if we should treat a set of similar words as the same word
• This is done by stemming in NLP (e.g. "Porter stemmer" looks at the etymological stem of a word)
• This may make your algorithm better or worse
• Also worth consider weighting error (false positive vs. false negative)
• e.g. is a false positive really bad, or is it worth have a few of one to improve performance a lot
• Can use numerical evaluation to compare the changes
• See if a change improves an algorithm or not
• A single real number may be hard/complicated to compute
• But makes it much easier to evaluate how changes impact your algorithm
• You should do error analysis on the cross validation set instead of the test set

Handling Skewed Data

Error Metrics for Skewed Classes

• Once case where it's hard to come up with good error metric - skewed classes
• Example
• Cancer classification
• Train logistic regression model hθ(x) where
• Cancer means y = 1
• Otherwise y = 0
• Test classifier on test set
• Get 1% error
• So this looks pretty good..
• But only 0.5% have cancer
• Now, 1% error looks very bad!
• So when one number of examples is very small this is an example of skewed classes
• LOTS more of one class than another
• So standard error metrics aren't so good
• Another example
• Algorithm has 99.2% accuracy
• Make a change, now get 99.5% accuracy
• Does this really represent an improvement to the algorithm?
• Did we do something useful, or did we just create something which predicts y = 0 more often
• Get very low error, but classifier is still not great
Precision and Recall
• Two new metrics - precision and recall

• Both give a value between 0 and 1

• Evaluating classifier on a test set

• For a test set, the actual class is 1 or 0

• Algorithm predicts some value for class, predicting a value for each example in the test set

• Considering this, classification can be
• True positive (we guessed 1, it was 1)
• False positive (we guessed 1, it was 0)
• True negative (we guessed 0, it was 0)
• False negative (we guessed 0, it was 1)
• Precision

• How often does our algorithm cause a false alarm?

• Of all patients we predicted have cancer, what fraction of them

actually have cancer

• = true positives / # predicted positive
• = true positives / (true positive + false positive)
• High precision is good (i.e. closer to 1)

• You want a big number, because you want false positive to be as close to 0 as possible
• Recall

• How sensitive is our algorithm?
• Of all patients in set that actually have cancer, what fraction did we correctly detect
• = true positives / # actual positives
• = true positive / (true positive + false negative)
• High recall is good (i.e. closer to 1)
• You want a big number, because you want false negative to be as close to 0 as possible
• By computing precision and recall get a better sense of how an algorithm is doing

• This can't really be gamed
• Means we're much more sure that an algorithm is good
• Typically we say the presence of a rare class is what we're trying to determine (e.g. positive (1) is the existence of the rare thing)

• For many applications we want to control the trade-off between precision and recall

• Example

• Trained a logistic regression classifier
• Predict 1 if hθ(x) >= 0.5
• Predict 0 if hθ(x) < 0.5
• This classifier may give some value for precision and some value for recall
• Predict 1 only if very confident
• One way to do this modify the algorithm we could modify the prediction threshold
• Predict 1 if hθ(x) >= 0.8
• Predict 0 if hθ(x) < 0.2
• Now we can be more confident a 1 is a true positive
• But classifier has lower recall - predict y = 1 for a smaller number of patients
• Risk of false negatives
• Another example - avoid false negatives
• This is probably worse for the cancer example
• Now we may set to a lower threshold
• Predict 1 if hθ(x) >= 0.3
• Predict 0 if hθ(x) < 0.7
• i.e. 30% chance they have cancer
• So now we have have a higher recall, but lower precision
• Risk of false positives, because we're less discriminating in deciding what means the person has cancer
• This threshold defines the trade-off

• We can show this graphically by plotting precision vs. recall

• This curve can take many different shapes depending on classifier details

• Is there a way to automatically chose the threshold

• Or, if we have a few algorithms, how do we compare different algorithms or parameter sets?

• How do we decide which of these algorithms is best?

• We spoke previously about using a single real number evaluation metric
• By switching to precision/recall we have two numbers
• Now comparison becomes harder
• Better to have just one number
• How can we convert P & R into one number?
• One option is the average - (P + R)/2
• This is not such a good solution
• Means if we have a classifier which predicts y = 1 all the time you get a high recall and low precision
• Similarly, if we predict Y rarely get high precision and low recall
• So averages here would be 0.45, 0.4 and 0.51
• 0.51 is best, despite having a recall of 1 - i.e. predict y=1 for everything
• So average isn't great
• F1Score (fscore)
• = 2 * (PR/ [P + R])
• Fscore is like taking the average of precision and recall giving a higher weight to the lower value
• Many formulas for computing comparable precision/accuracy values
• If P = 0 or R = 0 the Fscore = 0
• If P = 1 and R = 1 then Fscore = 1
• The remaining values lie between 0 and 1
• Threshold offers a way to control trade-off between precision and recall

• Fscore gives a single real number evaluation metric

• If you're trying to automatically set the threshold, one way is to try a range of threshold values and evaluate them on your cross validation set
• Then pick the threshold which gives the best fscore.

Using Large Data Sets

Data for Machine Learning

On early videos caution on just blindly getting more data. Turns out under certain conditions getting more data is a very effective way to improve performance.

• Algorithms give remarkably similar performance
• As training set sizes increases accuracy increases
• Take an algorithm, give it more data, should beat a "better" one with less data
• Shows that
• Algorithm choice is pretty similar
• More data helps

We use a learning algorithm with many parameters such as logistic regression or linear regression with many features, or neural networks with many hidden features.

• These are powerful learning algorithms with many parameters which can fit complex functions
• Such algorithms are low bias algorithms
• Little systemic bias in their description - flexible
• Use a small training set
• Training error should be small
• Use a very large training set
• If the training set error is close to the test set error
• Unlikely to over fit with our complex algorithms
• So the test set error should also be small

Another way to think about this is we want our algorithm to have low bias and low variance.

• Low bias --> use complex algorithm
• Low variance --> use large training set