## Classification and Representation

### Classification

y is a discrete value

Variable in these problems is Y

- Y is either 0 or 1
- 0 = negative class (absence of something)
- 1 = positive class (presence of something)

- Start with binary class problems
- Later look at multiclass classification problem

- Y is either 0 or 1

How do we develop a classification algorithm?

Tumor size vs malignancy (0 or 1)

We could use linear regression

Then threshold the classifier output (i.e. anything over some value is yes, else no)

The above graph does a reasonable job of stratifying the data points into one of two classes.

Problems with linear regression

- Y is 0 or 1
- Hypothesis can give the value large than 1 or less than 0

### Hypothesis Representation

We want our classifier to output values between 0 and 1

When using linear regression we did h

_{θ}(x) = (θ^{T}x)For classification hypothesis representation we do h

_{θ}(x) = g((θ^{T}x))- Where we define g(z)

z is a real number

g(z) = 1/(1 + e

^{-z})- This is the
**sigmoid function**, or the**logistic function**

- This is the
If we combine these equations we can write out the hypothesis as

What does the sigmoid function look like

- Interpreting hypothesis output
- When our hypothesis (h
_{θ}(x)) outputs a number, we treat that value as the estimated probability that y = 1 on input x- Example
- If X is a feature vector with x
_{0}= 1 (as always) and x_{1}= tumourSize - h
_{θ}(x) = 0.7- Tells a patient they have a 70% chance of a tumor being malignant

- If X is a feature vector with x
- We can write this using the following notation
- h
_{θ}(x) = P(y=1|x ; θ)

- h
- What does this mean?
- Probability that y=1, given x, parameterized by θ

- Example
- Since this is a binary classification task we know y = 0 or 1
- So the following must be true
- P(y=1|x ; θ) + P(y=0|x ; θ) = 1
- P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

- So the following must be true

- When our hypothesis (h

### Decision boundary

- When is it exactly that h
_{θ}(x) is greater than 0.5?- Look at sigmoid function
- g(z) is greater than or equal to 0.5 when z is greater than or equal to 0

- So if z is positive, g(z) is greater than 0.5
- z = (θ
^{T}x)

- z = (θ
- So when θ
^{T}x >= 0, h_{θ}>= 0.5

- Look at sigmoid function

#### Decision boundary

- h
_{θ}(x) = g(θ_{0}+ θ_{1}x_{1}+ θ_{2}x_{2})

- For example, θ
_{0}= -3, θ_{1}= 1, θ_{2}= 1- So, θ
^{T}is a row vector = [-3,1,1] - z = θ
^{T}x = -3x_{0}+ x_{1}+ x_{2} - We predict "y = 1" if
- -3 + x
_{1}+ x_{2}>= 0 - x
_{1}+ x_{2}= 3 is our decision boundary

- -3 + x

- So, θ

#### Non-linear decision boundaries

- Get logistic regression to fit a complex non-linear data set
- Like polynomial regress add higer order terms
- So say we have
- h
_{θ}(x) = g(θ_{0}+ θ_{1}x_{1}+ θ_{3}x_{1}^{2}+ θ_{4}x_{2}^{2})- Say θ
^{T}was [-1,0,0,1,1] then we say;- Predict that "y = 1" if
- -1 + x
_{1}^{2}+ x_{2}^{2}>= 0 or - x
_{1}^{2}+ x_{2}^{2}>= 1

- -1 + x

- Predict that "y = 1" if
- If we plot x
_{1}^{2}+ x_{2}^{2}= 1- This gives us a circle with a radius of 1 around 0

- Say θ

- h

By using higher order polynomial terms, we can get even more complex decision boundaries

## Logistic Regression Model

### Cost Function

Define the optimization object for the cost function we use the fit the parameters

This is the situation

- Set of m training examples
- Each example is a feature vector which is n+1 dimensional
- x
_{0}= 1 - y ∈ {0,1}
- Hypothesis is based on parameters (θ)

Linear regression uses the following function to determine θ

We can redefine J(θ) as

What does this actually mean?

- This is the cost you want the learning algorithm to pay if the outcome is h
_{θ}(x) and the actual outcome is y - If we use this function for logistic regression this is a
**non-convex function**for parameter optimization

- This is the cost you want the learning algorithm to pay if the outcome is h
What do we mean by non convex?

We have some function - J(θ) - for determining the parameters

Our hypothesis function has a non-linearity (sigmoid function of

h

_{θ}(x) )- This is a complicated non-linear function

If you take h

_{θ}(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex functionWhy is this a problem

- Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum

We would like a convex function so if you run gradient descent you converge to a global minimum

#### A convex logistic regression cost function

To get around this we need a different, convex Cost() function which means we can apply gradient descent

Plot y = 1

So when we're right, cost function is 0

- Else it slowly increases cost function as we become "more" wrong
- X axis is what we predict
- Y axis is the cost associated with that prediction

This cost functions has some interesting properties

If y = 1 and

hθ(x) = 1

- If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)

As h

_{θ}(x) goes to 0- Cost goes to infinity
- This captures the intuition that if h
_{θ}(x) = 0 (predict P (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost

Plot y = 0

### Simplified Cost Function and Gradient Descent

- Define a simpler way to write the cost function and apply gradient descent to the logistic regression

Now compress them into one equation - more efficient

**cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )**

In summary, our cost function can be defined as

Why do we chose this function when other cost functions exist?

- This cost function can be derived from statistics using the principle of maximum likelihood estimation
- Note this does mean there's an underlying Gaussian assumption relating to the distribution of features

- Also has the nice property that it's convex

- This cost function can be derived from statistics using the principle of maximum likelihood estimation
To fit parameters θ:

- Find parameters θ which minimize J(θ)
- This means we have a set of parameters to use in our model for future predictions

Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using

#### How to minimize the logistic regression cost function

- Use gradient descent as before
- Repeatedly update each parameter using a learning rate

- The equation is the same as the linear regression rule
- The only difference is that our definition for the hypothesis has changed

### Advanced Optimization

- What is gradient descent actually doing?
- We have some cost function J(θ), and we want to minimize it
- We need to write code which can take θ as input and compute the following
- J(θ)
- Partial derivative if J(θ) with respect to j (where j=0 to j = n)

Alternatively, instead of gradient descent to minimize the cost function we could use

**Conjugate gradient****BFGS**(Broyden-Fletcher-Goldfarb-Shanno)**L-BFGS**(Limited memory - BFGS)

These are more optimized algorithms which take that same input and minimize the cost function

These are very complicated algorithms

Some properties

- Advantages
- No need to manually pick alpha (learning rate)
- Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one

- Often faster than gradient descent
- Do more than just pick a good learning rate

- Can be used successfully without understanding their complexity

- No need to manually pick alpha (learning rate)
- Disadvantages
- Could make debugging more difficult
- Should not be implemented themselves
- Different libraries may use different implementations - may hit performance

- Advantages

**Using advanced cost minimization algorithms**

- How to use algorithms
- Say we have the following example

- First we need to define our cost function, which should have the following signature

1 | function [jval, gradent] = costFunction(THETA) |

- Input for the cost function is
**THETA**, which is a vector of the θ parameters - Two return values from
**costFunction**are- jval
- How we compute the cost function θ (the underived cost function)
- In this case = (θ
_{1}- 5)^{2}+ (θ_{2}- 5)^{2}

- In this case = (θ

- How we compute the cost function θ (the underived cost function)
- gradient
- 2 by 1 vector
- 2 elements are the two partial derivative terms
- i.e. this is an n-dimensional vector
- Each indexed value gives the partial derivatives for the partial derivative of J(θ) with respect to θi
- Where i is the index position in the
**gradient**vector

- jval
- With the cost function implemented, we can call the advanced algorithm using

1 | `options= optimset('GradObj', 'on', 'MaxIter', '100'); % define the options data structure` |

- Here
**options**is a data structure giving options for the algorithm- fminunc
- function minimize the cost function (find minimum of unconstrained multivariable function)

**@costFunction**is a pointer to the costFunction function to be used

- For the octave implementation
**initialTheta**must be a matrix of at least two dimensions

- How do we apply this to logistic regression?
- Here we have a vector

- Here
- theta is a n+1 dimensional column vector
- Octave indexes from 1, not 0

- Write a cost function which captures the cost function for logistic regression

## Multiclass Classification

### One-vs-all

- Getting logistic regression for multiclass classification using
**one vs. all** - Multiclass - more than yes or no (1 or 0)
- Classification with multiple classes for assignment

**One vs. all classification**Split the training set into three separate binary classification problems

i.e. create a new fake training set

- Triangle (1) vs crosses and squares (0) h
_{θ}^{1}(x)- P(y=1 | x1; θ)

- Crosses (1) vs triangle and square (0) h
_{θ}^{2}(x)- P(y=1 | x2; θ)

- Square (1) vs crosses and square (0) h
_{θ}^{3}(x)- P(y=1 | x3; θ)

- Triangle (1) vs crosses and squares (0) h

- Overall
- Train a logistic regression classifier h
_{θ}^{(i)}(x) for each class i to predict the probability that y = i - On a new input,
*x*to make a prediction, pick the class i that maximizes the probability that h_{θ}^{(i)}(x) = 1

- Train a logistic regression classifier h

# Regularization

## Solving the Problem of Overfitting

### The Problem of Overfitting

#### Overfitting with linear regression

- Take our house pricing as an example again
- Fit a linear function to the data
- This is
**underfitting**- also known as**high bias** - Bias is a historic/technical one - if we're fitting a straight line to the data we have a strong preconception that there should be a linear fit

- This is
- Fit a quadratic function
- Works well

- Fit a 4th order polynomial
- Now curve fit's through all five examples
- Seems to do a good job fitting the training set
- But, despite fitting the data we've provided very well, this is actually not such a good model

- This is
**overfitting**- also known as**high variance**

- Now curve fit's through all five examples
- Algorithm has high variance
- High variance - if fitting high order polynomial then the hypothesis can basically fit any data
- Space of hypothesis is too large

- Fit a linear function to the data

- To recap, if we have too many features then the learned hypothesis may give a cost function of exactly zero
- But this tries too hard to fit the training set
- Fails to provide a general solution - unable to generalize (apply to new examples)

#### Overfitting with logistic regression

- Same thing can happen to logistic regression
- Sigmoidal function is an underfit
- But a high order polynomial gives and overfitting (high variance hypothesis)

#### Addressing overfitting

- Reduce number of features
- Manually select which features to keep
- Model selection algorithms are discussed later (good for reducing number of features)
- But, in reducing the number of features we lose some information
- Ideally select those features which minimize data loss, but even so, some info is lost

- Regularization
- Keep all features, but reduce magnitude of parameters θ
- Works well when we have a lot of features, each of which contributes a bit to predicting y

### Cost function optimization for regularization

- Penalize and make some of the θ parameters really small
- e.g. here θ
_{3}and θ_{4}

- e.g. here θ

The addition in blue is a modification of our cost function to help penalize θ3 and θ4

- So here we end up with θ
_{3}and θ_{4}being close to zero (because the constants are massive) - So we're basically left with a quadratic function

- More generally, regularization is as follows

- So here we end up with θ
- Regularization
- Small values for parameters corresponds to a simpler hypothesis (you effectively get rid of some of the terms)
- A simpler hypothesis is less prone to overfitting

- Another example
- Have 100 features x
_{1}, x_{2}, ..., x_{100} - Unlike the polynomial example, we don't know what are the high order terms
- How do we pick the ones to pick to shrink?

- With regularization, take cost function and modify it to shrink all the parameters
- Add a term at the end
- This regularization term shrinks every parameter
- By convention you don't penalize θ
_{0}- minimization is from θ_{1}onwards

- Add a term at the end

- Have 100 features x

- λ is the regularization parameter
- Controls a trade off between our two goals
- Want to fit the training set well

- Want to keep parameters small

- Controls a trade off between our two goals
- λ should be chosen carefully - not too big...
- We look at some automatic ways to select λ later in the course

### Regularized linear regression

Our linear regression with regularization is shown below

Previously, gradient descent would repeatedly update the parameters θ

_{j}, where j = 0,1,2...n simultaneously- Shown below

- We've got the θ
_{0}update here shown explicitly- This is because for regularization we don't penalize θ
_{0}so treat it slightly differently

- This is because for regularization we don't penalize θ
- We can show using calculus that the equation given below is the partial derivative of the regularized J(θ)

So if you group θ

_{j}terms togetherThe term

- Is going to be a number less than 1 usually
- Usually learning rate is small and m is large
- So this typically evaluates to (1 - a small number)
- So the term is often around 0.99 to 0.95

This in effect means θj gets multiplied by 0.99

- Means the squared norm of θ
_{j}a little smaller - The second term is exactly the same as the original gradient descent

- Means the squared norm of θ

### Regularization with the normal equation

- Normal equation is the other linear regression model
- Minimize the J(θ) using the normal equation
- To use regularization we add a term (+ λ [n+1 x n+1]) to the equation
- [n+1 x n+1] is the n+1 identity matrix

#### Regularization for logistic regression

Logistic regression cost function is as follows:

To modify it we have to add an extra term

Modified logistic regression with gradient descent function was as follows: