Machine Learning Week3 Logistic Regression

Machine Learning Week3 Logistic Regression

Classification and Representation


  • y is a discrete value

  • Variable in these problems is Y

    • Y is either 0 or 1
      • 0 = negative class (absence of something)
      • 1 = positive class (presence of something)
    • Start with binary class problems
      • Later look at multiclass classification problem
  • How do we develop a classification algorithm?

    • Tumor size vs malignancy (0 or 1)

    • We could use linear regression

      • Then threshold the classifier output (i.e. anything over some value is yes, else no)

  • The above graph does a reasonable job of stratifying the data points into one of two classes.

  • Problems with linear regression

    • Y is 0 or 1
    • Hypothesis can give the value large than 1 or less than 0

Hypothesis Representation

  • We want our classifier to output values between 0 and 1

    • When using linear regression we did hθ(x) = (θT x)

    • For classification hypothesis representation we do hθ(x) = g((θT x))

      • Where we define g(z)
    • z is a real number

    • g(z) = 1/(1 + e-z)

      • This is the sigmoid function, or the logistic function
    • If we combine these equations we can write out the hypothesis as

  • What does the sigmoid function look like

  • Interpreting hypothesis output
    • When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y = 1 on input x
      • Example
        • If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
        • hθ(x) = 0.7
          • Tells a patient they have a 70% chance of a tumor being malignant
      • We can write this using the following notation
        • hθ(x) = P(y=1|x ; θ)
      • What does this mean?
        • Probability that y=1, given x, parameterized by θ
    • Since this is a binary classification task we know y = 0 or 1
      • So the following must be true
        • P(y=1|x ; θ) + P(y=0|x ; θ) = 1
        • P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary

  • When is it exactly that hθ(x) is greater than 0.5?
    • Look at sigmoid function
      • g(z) is greater than or equal to 0.5 when z is greater than or equal to 0

    • So if z is positive, g(z) is greater than 0.5
      • z = (θT x)
    • So when θT x >= 0, hθ >= 0.5

Decision boundary

  • hθ(x) = g(θ0 + θ1x1 + θ2x2)

  • For example, θ0 = -3, θ1 = 1, θ2 = 1
    • So, θT is a row vector = [-3,1,1]
    • z = θT x = -3x0 + x1 + x2
    • We predict "y = 1" if
      • -3 + x1 + x2 >= 0
      • x1 + x2 = 3 is our decision boundary

Non-linear decision boundaries

  • Get logistic regression to fit a complex non-linear data set
    • Like polynomial regress add higer order terms
    • So say we have
      • hθ(x) = g(θ0 + θ1x1+ θ3x12 + θ4x22)
        • Say θT was [-1,0,0,1,1] then we say;
          • Predict that "y = 1" if
            • -1 + x12 + x22 >= 0 or
            • x12 + x22 >= 1
        • If we plot x12 + x22 = 1
          • This gives us a circle with a radius of 1 around 0

  • By using higher order polynomial terms, we can get even more complex decision boundaries

Logistic Regression Model

Cost Function

  • Define the optimization object for the cost function we use the fit the parameters

  • This is the situation

    • Set of m training examples
    • Each example is a feature vector which is n+1 dimensional
    • x0 = 1
    • y ∈ {0,1}
    • Hypothesis is based on parameters (θ)
  • Linear regression uses the following function to determine θ

    • We can redefine J(θ) as

  • What does this actually mean?

    • This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
    • If we use this function for logistic regression this is a non-convex function for parameter optimization
  • What do we mean by non convex?

    • We have some function - J(θ) - for determining the parameters

    • Our hypothesis function has a non-linearity (sigmoid function of

      hθ(x) )

      • This is a complicated non-linear function
    • If you take hθ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex function

    • Why is this a problem

      • Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum
    • We would like a convex function so if you run gradient descent you converge to a global minimum

A convex logistic regression cost function

  • To get around this we need a different, convex Cost() function which means we can apply gradient descent

  • Plot y = 1

  • So when we're right, cost function is 0

    • Else it slowly increases cost function as we become "more" wrong
    • X axis is what we predict
    • Y axis is the cost associated with that prediction

This cost functions has some interesting properties

  • If y = 1 and

    hθ(x) = 1

    • If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
  • As hθ(x) goes to 0

    • Cost goes to infinity
    • This captures the intuition that if hθ(x) = 0 (predict P (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
  • Plot y = 0

Simplified Cost Function and Gradient Descent

  • Define a simpler way to write the cost function and apply gradient descent to the logistic regression

  • Now compress them into one equation - more efficient

    • cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
  • In summary, our cost function can be defined as

  • Why do we chose this function when other cost functions exist?

    • This cost function can be derived from statistics using the principle of maximum likelihood estimation
      • Note this does mean there's an underlying Gaussian assumption relating to the distribution of features
    • Also has the nice property that it's convex
  • To fit parameters θ:

    • Find parameters θ which minimize J(θ)
    • This means we have a set of parameters to use in our model for future predictions
  • Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using

How to minimize the logistic regression cost function

  • Use gradient descent as before
  • Repeatedly update each parameter using a learning rate

  • The equation is the same as the linear regression rule
    • The only difference is that our definition for the hypothesis has changed

Advanced Optimization

  • What is gradient descent actually doing?
    • We have some cost function J(θ), and we want to minimize it
    • We need to write code which can take θ as input and compute the following
      • J(θ)
      • Partial derivative if J(θ) with respect to j (where j=0 to j = n)
  • Alternatively, instead of gradient descent to minimize the cost function we could use

    • Conjugate gradient
    • BFGS (Broyden-Fletcher-Goldfarb-Shanno)
    • L-BFGS (Limited memory - BFGS)
  • These are more optimized algorithms which take that same input and minimize the cost function

  • These are very complicated algorithms

  • Some properties

    • Advantages
      • No need to manually pick alpha (learning rate)
        • Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one
      • Often faster than gradient descent
        • Do more than just pick a good learning rate
      • Can be used successfully without understanding their complexity
    • Disadvantages
      • Could make debugging more difficult
      • Should not be implemented themselves
      • Different libraries may use different implementations - may hit performance

Using advanced cost minimization algorithms

  • How to use algorithms
    • Say we have the following example

  • First we need to define our cost function, which should have the following signature
function [jval, gradent] = costFunction(THETA)
  • Input for the cost function is THETA, which is a vector of the θ parameters
  • Two return values from costFunction are
    • jval
      • How we compute the cost function θ (the underived cost function)
        • In this case = (θ1 - 5)2 + (θ2 - 5)2
    • gradient
      • 2 by 1 vector
      • 2 elements are the two partial derivative terms
      • i.e. this is an n-dimensional vector
        • Each indexed value gives the partial derivatives for the partial derivative of J(θ) with respect to θi
        • Where i is the index position in the gradient vector
  • With the cost function implemented, we can call the advanced algorithm using
options= optimset('GradObj', 'on', 'MaxIter', '100'); % define the options data structure
initialTheta= zeros(2,1); # set the initial dimensions for theta % initialize the theta values
[optTheta, funtionVal, exitFlag]= fminunc(@costFunction, initialTheta, options); % run the algorithm
  • Here
    • options is a data structure giving options for the algorithm
    • fminunc
      • function minimize the cost function (find minimum of unconstrained multivariable function)
    • @costFunction is a pointer to the costFunction function to be used
  • For the octave implementation
    • initialTheta must be a matrix of at least two dimensions
  • How do we apply this to logistic regression?
    • Here we have a vector

  • Here
    • theta is a n+1 dimensional column vector
    • Octave indexes from 1, not 0
  • Write a cost function which captures the cost function for logistic regression

Multiclass Classification


  • Getting logistic regression for multiclass classification using one vs. all
  • Multiclass - more than yes or no (1 or 0)
    • Classification with multiple classes for assignment

  • One vs. all classification

    • Split the training set into three separate binary classification problems

      • i.e. create a new fake training set

        • Triangle (1) vs crosses and squares (0) hθ1(x)
          • P(y=1 | x1; θ)
        • Crosses (1) vs triangle and square (0) hθ2(x)
          • P(y=1 | x2; θ)
        • Square (1) vs crosses and square (0) hθ3(x)
          • P(y=1 | x3; θ)

  • Overall
    • Train a logistic regression classifier hθ(i)(x) for each class i to predict the probability that y = i
    • On a new input, x to make a prediction, pick the class i that maximizes the probability that hθ(i)(x) = 1


Solving the Problem of Overfitting

The Problem of Overfitting

Overfitting with linear regression

  • Take our house pricing as an example again
    • Fit a linear function to the data
      • This is underfitting - also known as high bias
      • Bias is a historic/technical one - if we're fitting a straight line to the data we have a strong preconception that there should be a linear fit
    • Fit a quadratic function
      • Works well
    • Fit a 4th order polynomial
      • Now curve fit's through all five examples
        • Seems to do a good job fitting the training set
        • But, despite fitting the data we've provided very well, this is actually not such a good model
      • This is overfitting - also known as high variance
    • Algorithm has high variance
      • High variance - if fitting high order polynomial then the hypothesis can basically fit any data
      • Space of hypothesis is too large

  • To recap, if we have too many features then the learned hypothesis may give a cost function of exactly zero
    • But this tries too hard to fit the training set
    • Fails to provide a general solution - unable to generalize (apply to new examples)

Overfitting with logistic regression

  • Same thing can happen to logistic regression
    • Sigmoidal function is an underfit
    • But a high order polynomial gives and overfitting (high variance hypothesis)

Addressing overfitting

  1. Reduce number of features
    • Manually select which features to keep
    • Model selection algorithms are discussed later (good for reducing number of features)
    • But, in reducing the number of features we lose some information
      • Ideally select those features which minimize data loss, but even so, some info is lost
  2. Regularization
    • Keep all features, but reduce magnitude of parameters θ
    • Works well when we have a lot of features, each of which contributes a bit to predicting y

Cost function optimization for regularization

  • Penalize and make some of the θ parameters really small
    • e.g. here θ3 and θ4

  • The addition in blue is a modification of our cost function to help penalize θ3 and θ4

    • So here we end up with θ3 and θ4 being close to zero (because the constants are massive)
    • So we're basically left with a quadratic function

    • More generally, regularization is as follows
  • Regularization
    • Small values for parameters corresponds to a simpler hypothesis (you effectively get rid of some of the terms)
    • A simpler hypothesis is less prone to overfitting
  • Another example
    • Have 100 features x1, x2, ..., x100
    • Unlike the polynomial example, we don't know what are the high order terms
      • How do we pick the ones to pick to shrink?
    • With regularization, take cost function and modify it to shrink all the parameters
      • Add a term at the end
        • This regularization term shrinks every parameter
        • By convention you don't penalize θ0 - minimization is from θ1 onwards

  • λ is the regularization parameter
    • Controls a trade off between our two goals
        1. Want to fit the training set well
        1. Want to keep parameters small
  • λ should be chosen carefully - not too big...
    • We look at some automatic ways to select λ later in the course

Regularized linear regression

  • Our linear regression with regularization is shown below

  • Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n simultaneously

    • Shown below

  • We've got the θ0 update here shown explicitly
    • This is because for regularization we don't penalize θ0 so treat it slightly differently
  • We can show using calculus that the equation given below is the partial derivative of the regularized J(θ)

  • So if you group θj terms together

  • The term

    • Is going to be a number less than 1 usually
    • Usually learning rate is small and m is large
      • So this typically evaluates to (1 - a small number)
      • So the term is often around 0.99 to 0.95
  • This in effect means θj gets multiplied by 0.99

    • Means the squared norm of θj a little smaller
    • The second term is exactly the same as the original gradient descent

Regularization with the normal equation

  • Normal equation is the other linear regression model
    • Minimize the J(θ) using the normal equation
    • To use regularization we add a term (+ λ [n+1 x n+1]) to the equation
      • [n+1 x n+1] is the n+1 identity matrix

Regularization for logistic regression

  • Logistic regression cost function is as follows:

  • To modify it we have to add an extra term

  • Modified logistic regression with gradient descent function was as follows:


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now