# Machine Learning Week3 Logistic Regression

## Classification and Representation

### Classification

• y is a discrete value

• Variable in these problems is Y

• Y is either 0 or 1
• 0 = negative class (absence of something)
• 1 = positive class (presence of something)
• Later look at multiclass classification problem
• How do we develop a classification algorithm?

• Tumor size vs malignancy (0 or 1)

• We could use linear regression

• Then threshold the classifier output (i.e. anything over some value is yes, else no)

• The above graph does a reasonable job of stratifying the data points into one of two classes.

• Problems with linear regression

• Y is 0 or 1
• Hypothesis can give the value large than 1 or less than 0

### Hypothesis Representation

• We want our classifier to output values between 0 and 1

• When using linear regression we did hθ(x) = (θT x)

• For classification hypothesis representation we do hθ(x) = g((θT x))

• Where we define g(z)
• z is a real number

• g(z) = 1/(1 + e-z)

• This is the sigmoid function, or the logistic function
• If we combine these equations we can write out the hypothesis as

• What does the sigmoid function look like

• Interpreting hypothesis output
• When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y = 1 on input x
• Example
• If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
• hθ(x) = 0.7
• Tells a patient they have a 70% chance of a tumor being malignant
• We can write this using the following notation
• hθ(x) = P(y=1|x ; θ)
• What does this mean?
• Probability that y=1, given x, parameterized by θ
• Since this is a binary classification task we know y = 0 or 1
• So the following must be true
• P(y=1|x ; θ) + P(y=0|x ; θ) = 1
• P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

### Decision boundary

• When is it exactly that hθ(x) is greater than 0.5?
• Look at sigmoid function
• g(z) is greater than or equal to 0.5 when z is greater than or equal to 0

• So if z is positive, g(z) is greater than 0.5
• z = (θT x)
• So when θT x >= 0, hθ >= 0.5

#### Decision boundary

• hθ(x) = g(θ0 + θ1x1 + θ2x2)

• For example, θ0 = -3, θ1 = 1, θ2 = 1
• So, θT is a row vector = [-3,1,1]
• z = θT x = -3x0 + x1 + x2
• We predict "y = 1" if
• -3 + x1 + x2 >= 0
• x1 + x2 = 3 is our decision boundary

#### Non-linear decision boundaries

• Get logistic regression to fit a complex non-linear data set
• Like polynomial regress add higer order terms
• So say we have
• hθ(x) = g(θ0 + θ1x1+ θ3x12 + θ4x22)
• Say θT was [-1,0,0,1,1] then we say;
• Predict that "y = 1" if
• -1 + x12 + x22 >= 0 or
• x12 + x22 >= 1
• If we plot x12 + x22 = 1
• This gives us a circle with a radius of 1 around 0

• By using higher order polynomial terms, we can get even more complex decision boundaries

## Logistic Regression Model

### Cost Function

• Define the optimization object for the cost function we use the fit the parameters

• This is the situation

• Set of m training examples
• Each example is a feature vector which is n+1 dimensional
• x0 = 1
• y ∈ {0,1}
• Hypothesis is based on parameters (θ)
• Linear regression uses the following function to determine θ

• We can redefine J(θ) as

• What does this actually mean?

• This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
• If we use this function for logistic regression this is a non-convex function for parameter optimization
• What do we mean by non convex?

• We have some function - J(θ) - for determining the parameters

• Our hypothesis function has a non-linearity (sigmoid function of

hθ(x) )

• This is a complicated non-linear function
• If you take hθ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex function

• Why is this a problem

• Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum
• We would like a convex function so if you run gradient descent you converge to a global minimum

#### A convex logistic regression cost function

• To get around this we need a different, convex Cost() function which means we can apply gradient descent

• Plot y = 1

• So when we're right, cost function is 0

• Else it slowly increases cost function as we become "more" wrong
• X axis is what we predict
• Y axis is the cost associated with that prediction

This cost functions has some interesting properties

• If y = 1 and

hθ(x) = 1

• If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
• As hθ(x) goes to 0

• Cost goes to infinity
• This captures the intuition that if hθ(x) = 0 (predict P (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
• Plot y = 0

### Simplified Cost Function and Gradient Descent

• Define a simpler way to write the cost function and apply gradient descent to the logistic regression

• Now compress them into one equation - more efficient

• cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
• In summary, our cost function can be defined as

• Why do we chose this function when other cost functions exist?

• This cost function can be derived from statistics using the principle of maximum likelihood estimation
• Note this does mean there's an underlying Gaussian assumption relating to the distribution of features
• Also has the nice property that it's convex
• To fit parameters θ:

• Find parameters θ which minimize J(θ)
• This means we have a set of parameters to use in our model for future predictions
• Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using

#### How to minimize the logistic regression cost function

• Use gradient descent as before
• Repeatedly update each parameter using a learning rate

• The equation is the same as the linear regression rule
• The only difference is that our definition for the hypothesis has changed

• What is gradient descent actually doing?
• We have some cost function J(θ), and we want to minimize it
• We need to write code which can take θ as input and compute the following
• J(θ)
• Partial derivative if J(θ) with respect to j (where j=0 to j = n)
• Alternatively, instead of gradient descent to minimize the cost function we could use

• BFGS (Broyden-Fletcher-Goldfarb-Shanno)
• L-BFGS (Limited memory - BFGS)
• These are more optimized algorithms which take that same input and minimize the cost function

• These are very complicated algorithms

• Some properties

• No need to manually pick alpha (learning rate)
• Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one
• Often faster than gradient descent
• Do more than just pick a good learning rate
• Can be used successfully without understanding their complexity
• Could make debugging more difficult
• Should not be implemented themselves
• Different libraries may use different implementations - may hit performance

Using advanced cost minimization algorithms

• How to use algorithms
• Say we have the following example

• First we need to define our cost function, which should have the following signature
• Input for the cost function is THETA, which is a vector of the θ parameters
• Two return values from costFunction are
• jval
• How we compute the cost function θ (the underived cost function)
• In this case = (θ1 - 5)2 + (θ2 - 5)2
• 2 by 1 vector
• 2 elements are the two partial derivative terms
• i.e. this is an n-dimensional vector
• Each indexed value gives the partial derivatives for the partial derivative of J(θ) with respect to θi
• Where i is the index position in the gradient vector
• With the cost function implemented, we can call the advanced algorithm using
• Here
• options is a data structure giving options for the algorithm
• fminunc
• function minimize the cost function (find minimum of unconstrained multivariable function)
• @costFunction is a pointer to the costFunction function to be used
• For the octave implementation
• initialTheta must be a matrix of at least two dimensions
• How do we apply this to logistic regression?
• Here we have a vector

• Here
• theta is a n+1 dimensional column vector
• Octave indexes from 1, not 0
• Write a cost function which captures the cost function for logistic regression

## Multiclass Classification

### One-vs-all

• Getting logistic regression for multiclass classification using one vs. all
• Multiclass - more than yes or no (1 or 0)
• Classification with multiple classes for assignment

• One vs. all classification

• Split the training set into three separate binary classification problems

• i.e. create a new fake training set

• Triangle (1) vs crosses and squares (0) hθ1(x)
• P(y=1 | x1; θ)
• Crosses (1) vs triangle and square (0) hθ2(x)
• P(y=1 | x2; θ)
• Square (1) vs crosses and square (0) hθ3(x)
• P(y=1 | x3; θ)

• Overall
• Train a logistic regression classifier hθ(i)(x) for each class i to predict the probability that y = i
• On a new input, x to make a prediction, pick the class i that maximizes the probability that hθ(i)(x) = 1

# Regularization

## Solving the Problem of Overfitting

### The Problem of Overfitting

#### Overfitting with linear regression

• Take our house pricing as an example again
• Fit a linear function to the data
• This is underfitting - also known as high bias
• Bias is a historic/technical one - if we're fitting a straight line to the data we have a strong preconception that there should be a linear fit
• Fit a quadratic function
• Works well
• Fit a 4th order polynomial
• Now curve fit's through all five examples
• Seems to do a good job fitting the training set
• But, despite fitting the data we've provided very well, this is actually not such a good model
• This is overfitting - also known as high variance
• Algorithm has high variance
• High variance - if fitting high order polynomial then the hypothesis can basically fit any data
• Space of hypothesis is too large

• To recap, if we have too many features then the learned hypothesis may give a cost function of exactly zero
• But this tries too hard to fit the training set
• Fails to provide a general solution - unable to generalize (apply to new examples)

#### Overfitting with logistic regression

• Same thing can happen to logistic regression
• Sigmoidal function is an underfit
• But a high order polynomial gives and overfitting (high variance hypothesis)

1. Reduce number of features
• Manually select which features to keep
• Model selection algorithms are discussed later (good for reducing number of features)
• But, in reducing the number of features we lose some information
• Ideally select those features which minimize data loss, but even so, some info is lost
2. Regularization
• Keep all features, but reduce magnitude of parameters θ
• Works well when we have a lot of features, each of which contributes a bit to predicting y

### Cost function optimization for regularization

• Penalize and make some of the θ parameters really small
• e.g. here θ3 and θ4

• The addition in blue is a modification of our cost function to help penalize θ3 and θ4

• So here we end up with θ3 and θ4 being close to zero (because the constants are massive)
• So we're basically left with a quadratic function

• More generally, regularization is as follows
• Regularization
• Small values for parameters corresponds to a simpler hypothesis (you effectively get rid of some of the terms)
• A simpler hypothesis is less prone to overfitting
• Another example
• Have 100 features x1, x2, ..., x100
• Unlike the polynomial example, we don't know what are the high order terms
• How do we pick the ones to pick to shrink?
• With regularization, take cost function and modify it to shrink all the parameters
• Add a term at the end
• This regularization term shrinks every parameter
• By convention you don't penalize θ0 - minimization is from θ1 onwards

• λ is the regularization parameter
• Controls a trade off between our two goals
1. Want to fit the training set well
1. Want to keep parameters small
• λ should be chosen carefully - not too big...
• We look at some automatic ways to select λ later in the course

### Regularized linear regression

• Our linear regression with regularization is shown below

• Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n simultaneously

• Shown below

• We've got the θ0 update here shown explicitly
• This is because for regularization we don't penalize θ0 so treat it slightly differently
• We can show using calculus that the equation given below is the partial derivative of the regularized J(θ)

• So if you group θj terms together

• The term

• Is going to be a number less than 1 usually
• Usually learning rate is small and m is large
• So this typically evaluates to (1 - a small number)
• So the term is often around 0.99 to 0.95
• This in effect means θj gets multiplied by 0.99

• Means the squared norm of θj a little smaller
• The second term is exactly the same as the original gradient descent

### Regularization with the normal equation

• Normal equation is the other linear regression model
• Minimize the J(θ) using the normal equation
• To use regularization we add a term (+ λ [n+1 x n+1]) to the equation
• [n+1 x n+1] is the n+1 identity matrix

#### Regularization for logistic regression

• Logistic regression cost function is as follows:

• To modify it we have to add an extra term

• Modified logistic regression with gradient descent function was as follows: