 # Machine Learning Week7 Support Vector Machines

## Large Margin Classification

### Optimization Objective

#### An alternative view of logistic regression

• Begin with logistic regression, see how we can modify it to get the SVM • With hθ(x) close to 1, (θTx) must be much larger than 0
• With hθ(x) close to 0, (θTx) must be much less than 0
• Look at the cost function, each example has a term like the one below If we plug in the hθ(x), we get • If y = 1, we plot the function vs. z we get the following graph • So if z is big, the cost is low - this is good!
• But if z is 0 or negative the cost contribution is high
• This is why, when logistic regression sees a positive example, it tries to set θT x to be a very large term
• If y = 0, we plot the function vs. z we get the following graph Same deal, if z is small then the cost is low. But if s is large then the cost is massive

#### SVM cost functions from logistic regression cost function

To build a SVM we must redefine our cost functions

• When y = 1

• Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression y = 1 function • This means we have two straight lines

• Flat when cost is 0
• Straight growing line after 1
• So this is the new y=1 cost function

• Gives the SVM a computational advantage and an easier optimization problem
• We call this function cost1(z)
• When y = 0

• Do the equivalent with the y=0 function plot • We call this function cost0(z)

#### The complete SVM cost function #### SVM notation is slightly different

• Get rid of the 1/m term

• Since 1/m is a constant, we should get the same optimization

• For logistic regression we had two terms

• Training data set term (i.e. that we sum over m) = A

• Regularization term (i.e. that we sum over n) = B

• So we could describe it as A + λB
• Set different values for λ to parametrize this trade-off
• Instead of parameterization this as A + λB

• For SVMs the convention is to use a different parameter called C
• So do CA + B
• If C were equal to 1/λ then the two functions (CA + B and A + λB) would give the same value
• Our overall equation is ### Large margin intuition

Some people refer to SVM as large margin intuition • Left is cost1 and right is cost0
• Interesting property of SVM
• If we have a positive example, we only really need z to be greater or equal to 0
• If this is the case then we predict 1
• SVM wants a bit more than that - doesn't want to just get it right, but have the value be quite a bit bigger than zero
• Throws in an extra safety margin factor
• SVM does the right thing if you use a normal size C
• So the idea of SVM being a large margin classifier is only really relevant when you have no outliers and you can easily linearly separable data
• Means we ignore a few outliers

### Mathematics Behind Large Margin Classification

#### SVM decision boundary Our optimization function can be re-defined as  • θT xi = pi * ||θ||= θ1x1i + θ2x2i

• What does this mean?
• The constraints we defined earlier

• T x) >= 1 if y = 1
• T x) <= -1 if y = 0
• Can be replaced/substituted with the constraints

• pi * ||θ|| >= 1 if y = 1
• pi * ||θ|| <= -1 if y = 0
• Writing that into our optimization objective ## Kernels

### Kernels I

Adapting SVM to non-linear classifiers, we want to find a non-linear boundary.

Come up with a complex set of polynomial features to fit the data

• Have hθ(x) which
• Returns 1 if the combined weighted sum of vectors (weighted by the parameter vector) is less than or equal to 0
• Else return 0
• Another way of writing this (new notation) is
• That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order x terms
• e.g.
• hθ(x) = θ0+ θ1f1+ θ2f2 + θ3f3
• Where
• f1= x1
• f2 = x1x2
• f3 = ...
• i.e. not specific values, but each of the terms from your complex polynomial function
• New features

• Given x, compute new feature depending on proximity to landmarks l(i) • fi = similarity(x, li) = exp(- (|| x - li ||2 ) / 2σ2) • This similarity function is called the Gaussian kernel

#### Diving deeper into the kernel

• Say x is close to a landmark, then the squared distance will be ~ 0 f1 is close to e0 = 1

• Say x is far from a landmark, then the squared distance is big

f is close to 0

Now plot f1 vs the kernel function • σ2 is a parameter of the Gaussian kernel

Above example σ2 = 1

Below σ2 =0.5 Now if σ2 =3 ### Kernels II

#### Choosing the landmarks

• Take the training data
• For each example place a landmark at exactly the same location
• So end up with m landmarks
• One landmark per location per training example
• Means our features measure how close to a training set example something is
• Given a new example, compute all the f values
• Gives you a feature vector f (f0 to fm)
• f0 = 1 always

A more detailed look at generating the f vector

• If we had a training example - features we compute would be using (xi, y

i)

• So we just cycle through each landmark, calculating how close to that landmark actually xi is
• f1i, = k(xi, l1)
• f2i, = k(x2, l2)
• ...
• fmi, = k(xi, lm)

#### SVM parameters(C)

• Bias and variance trade off
• Must chose C
• C plays a role similar to 1/λ (whereλ is the regularization parameter)
• Large C gives a hypothesis of low bias high variance --> overfitting
• Small C gives a hypothesis of high bias low variance --> underfitting

#### SVM parameters(σ2 )

Parameter for calculating f values

• Large σ2 - f features vary more smoothly - higher bias, lower variance
• Small σ2 - f features vary abruptly - low bias, high variance

## SVMs in Practice

### Using An SVM

#### Choosing a kernel

• Gaussian kernel
• need to define σ2
• if n is small or m is large
• implement the kernel function
• f1 = kernel(x1, x2) return a real number
• Gaussian is probably most popular kernel
• make sure you perform feature scaling before using a Gaussian kernel
• Linear kernel
• Predict y = 1 if (θT x) >= 0
• So no f vector
• Get a standard linear classifier
• if n is large and m is small
• Lots of features, few examples
• Not enough data - risk overfitting in a high dimensional feature-space
• Other choice
• Not all similarity functions you develop are valid kernels
• Must satisfy Merecer's Theorem
• SVM use numerical optimization tricks
• Mean certain optimizations can be made, but they must follow the theorem
• Polynomial Kernel
• We measure the similarity of x and l by doing one of
• (xT l)2
• (xT l)3
• (xT l+1)3
• General form is
• (xT l+Con)D
• If they're similar then the inner product tends to be large
• Not used that often
• Two parameters
• Degree of polynomial (D)
• Number you add to l (Con)
• Usually performs worse than the Gaussian kernel
• Used when x and l are both non-negative
• String kernel
• Used if input is text strings
• Use for text classification
• Chi-squared kernel
• Histogram intersection kernel

#### Multi-class classification for SVM

• Many packages have built in multi-class classification packages
• Otherwise use one-vs all method
• Not a big issue

#### Logistic regression vs. SVM

• If n (features) is large vs. m (training set)
• e.g. text classification problem
• Feature vector dimension is 10 000
• Training set is 10 - 1000
• Then use logistic regression or SVM with a linear kernel
• f n is small and m is intermediate
• n = 1 - 1000
• m = 10 - 10 000
• Gaussian kernel is good
• If n is small and m is large
• n = 1 - 1000
• m = 50 000+
• SVM will be slow to run with Gaussian kernel
• In that case
• Manually create or add more features
• Use logistic regression of SVM with a linear kernel