## Large Margin Classification

### Optimization Objective

#### An alternative view of logistic regression

Begin with logistic regression, see how we can modify it to get the SVM

- With h
_{θ}(x) close to 1, (θ^{T}x) must be much larger than 0 - With h
_{θ}(x) close to 0, (θ^{T}x) must be much less than 0

- With h

Look at the cost function, each example has a term like the one below

If we plug in the h

_{θ}(x), we getIf y = 1, we plot the function vs. z we get the following graph

- So if z is big, the cost is low - this is good!
- But if z is 0 or negative the cost contribution is high
- This is why, when logistic regression sees a positive example, it tries to set θ
*T*x to be a very large term

If y = 0, we plot the function vs. z we get the following graph

Same deal, if z is small then the cost is low. But if s is large then the cost is massive

#### SVM cost functions from logistic regression cost function

To build a SVM we must redefine our cost functions

When y = 1

Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression y = 1 function

This means we have two straight lines

- Flat when cost is 0
- Straight growing line after 1

So this is the new y=1 cost function

- Gives the SVM a computational advantage and an easier optimization problem
- We call this function
**cost**_{1}(z)

When y = 0

Do the equivalent with the y=0 function plot

- We call this function
**cost**_{0}(z)

- We call this function

#### The complete SVM cost function

#### SVM notation is slightly different

Get rid of the 1/m term

Since 1/m is a constant, we should get the same optimization

For logistic regression we had two terms

Training data set term (i.e. that we sum over m) = A

Regularization term (i.e. that we sum over n) = B

- So we could describe it as A + λB
- Set different values for λ to parametrize this trade-off

Instead of parameterization this as A + λB

- For SVMs the convention is to use a different parameter called C
- So do CA + B
- If C were equal to 1/λ then the two functions (CA + B and A + λB) would give the same value

Our overall equation is

### Large margin intuition

Some people refer to SVM as large margin intuition

- Left is cost
_{1}and right is cost_{0} - Interesting property of SVM
- If we have a positive example, we only really need z to be greater or equal to 0
- If this is the case then we predict 1

- SVM wants a bit more than that - doesn't want to just get it right, but have the value be quite a bit bigger than zero
- Throws in an extra safety margin factor

- If we have a positive example, we only really need z to be greater or equal to 0
- SVM does the right thing if you use a normal size C
- So the idea of SVM being a large margin classifier is only really relevant when you have no outliers and you can easily linearly separable data

- Means we ignore a few outliers

### Mathematics Behind Large Margin Classification

#### SVM decision boundary

Our optimization function can be re-defined as

θ

^{T}x^{i}= p^{i}* ||θ||= θ_{1}x_{1}^{i}+ θ_{2}x_{2}^{i}- What does this mean?
The constraints we defined earlier

- (θ
^{T}x) >= 1 if y = 1 - (θ
^{T}x) <= -1 if y = 0

- (θ
Can be replaced/substituted with the constraints

- p
^{i}* ||θ|| >= 1 if y = 1 - p
^{i}* ||θ|| <= -1 if y = 0

- p
Writing that into our optimization objective

## Kernels

### Kernels I

Adapting SVM to non-linear classifiers, we want to find a non-linear boundary.

Come up with a complex set of polynomial features to fit the data

- Have h
_{θ}(x) which- Returns 1 if the combined weighted sum of vectors (weighted by the parameter vector) is less than or equal to 0
- Else return 0

- Another way of writing this (new notation) is
- That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order x terms
- e.g.
- h
_{θ}(x) = θ_{0}+ θ_{1}f_{1}+ θ_{2}f_{2}+ θ_{3}f_{3} - Where
- f
_{1}= x_{1} - f
_{2}= x_{1}x_{2} - f
_{3}= ... - i.e. not specific values, but each of the terms from your complex polynomial function

- f

- h

New features

Given x, compute new feature depending on proximity to landmarks l

_{(i)}f

_{i}= similarity(x, l_{i}) = exp(- (|| x - l_{i}||2 ) / 2σ2)This similarity function is called the Gaussian kernel

#### Diving deeper into the kernel

Say x is close to a landmark, then the squared distance will be ~ 0

f

_{1}is close to e^{0}= 1Say x is far from a landmark, then the squared distance is big

f is close to 0

Now plot f_{1} vs the kernel function

σ

^{2}is a parameter of the Gaussian kernelAbove example σ

^{2}= 1Below σ

^{2}=0.5Now if σ

^{2}=3

### Kernels II

#### Choosing the landmarks

- Take the training data
- For each example place a landmark at exactly the same location
- So end up with m landmarks
- One landmark per location per training example
- Means our features measure how close to a training set example something is

- Given a new example, compute all the f values
- Gives you a feature vector f (f
_{0}to f_{m})- f
_{0}= 1 always

- f

- Gives you a feature vector f (f

A more detailed look at generating the f vector

If we had a training example - features we compute would be using (x

^{i}, y^{i})- So we just cycle through each landmark, calculating how close to that landmark actually x
^{i}is- f
_{1}^{i}, = k(x^{i}, l^{1}) - f
_{2}^{i}, = k(x^{2}, l^{2}) - ...
- f
_{m}^{i}, = k(x^{i}, l^{m})

- f

- So we just cycle through each landmark, calculating how close to that landmark actually x

#### SVM parameters(C)

- Bias and variance trade off
- Must chose C
- C plays a role similar to 1/λ (whereλ is the regularization parameter)

- Large C gives a hypothesis of
**low bias high variance**--> overfitting - Small C gives a hypothesis of
**high bias low variance**--> underfitting

#### SVM parameters(σ^{2} )

Parameter for calculating f values

- Large σ
^{2}- f features vary more smoothly - higher bias, lower variance - Small σ
^{2}- f features vary abruptly - low bias, high variance

## SVMs in Practice

### Using An SVM

#### Choosing a kernel

- Gaussian kernel
- need to define σ
^{2} - if n is small or m is large
- implement the kernel function
- f
_{1}= kernel(x_{1}, x_{2}) return a real number - Gaussian is probably most popular kernel

- f
- make sure you perform
**feature scaling**before using a Gaussian kernel

- need to define σ
- Linear kernel
- Predict y = 1 if (θ
^{T}x) >= 0- So no f vector
- Get a standard linear classifier

- if n is large and m is small
- Lots of features, few examples
- Not enough data - risk overfitting in a high dimensional feature-space

- Predict y = 1 if (θ
- Other choice
- Not all similarity functions you develop are valid kernels
- Must satisfy
**Merecer's Theorem** - SVM use numerical optimization tricks
- Mean certain optimizations can be made, but they must follow the theorem

- Must satisfy
**Polynomial Kernel**- We measure the similarity of x and l by doing one of
- (x
^{T}l)2 - (x
^{T}l)3 - (x
^{T}l+1)3

- (x
- General form is
- (x
^{T}l+Con)^{D}

- (x
- If they're similar then the inner product tends to be large
- Not used that often
- Two parameters
- Degree of polynomial (D)
- Number you add to l (Con)

- Usually performs worse than the Gaussian kernel
- Used when x and l are both non-negative

- We measure the similarity of x and l by doing one of
- String kernel
- Used if input is text strings
- Use for text classification

- Chi-squared kernel
- Histogram intersection kernel

- Not all similarity functions you develop are valid kernels

#### Multi-class classification for SVM

- Many packages have built in multi-class classification packages
- Otherwise use one-vs all method
- Not a big issue

#### Logistic regression vs. SVM

- If n (features) is large vs. m (training set)
- e.g. text classification problem
- Feature vector dimension is 10 000
- Training set is 10 - 1000
- Then use logistic regression or SVM with a linear kernel

- e.g. text classification problem
- f n is small and m is intermediate
- n = 1 - 1000
- m = 10 - 10 000
- Gaussian kernel is good

- If n is small and m is large
- n = 1 - 1000
- m = 50 000+
- SVM will be slow to run with Gaussian kernel

- In that case
- Manually create or add more features
- Use logistic regression of SVM with a linear kernel