Machine Learning Week7 Support Vector Machines

Machine Learning Week7 Support Vector Machines

Large Margin Classification

Optimization Objective

An alternative view of logistic regression

  • Begin with logistic regression, see how we can modify it to get the SVM

    • With hθ(x) close to 1, (θTx) must be much larger than 0
    • With hθ(x) close to 0, (θTx) must be much less than 0
  • Look at the cost function, each example has a term like the one below

    If we plug in the hθ(x), we get

  • If y = 1, we plot the function vs. z we get the following graph

    • So if z is big, the cost is low - this is good!
    • But if z is 0 or negative the cost contribution is high
    • This is why, when logistic regression sees a positive example, it tries to set θT x to be a very large term
  • If y = 0, we plot the function vs. z we get the following graph

    Same deal, if z is small then the cost is low. But if s is large then the cost is massive

SVM cost functions from logistic regression cost function

To build a SVM we must redefine our cost functions

  • When y = 1

    • Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression y = 1 function

    • This means we have two straight lines

      • Flat when cost is 0
      • Straight growing line after 1
    • So this is the new y=1 cost function

      • Gives the SVM a computational advantage and an easier optimization problem
      • We call this function cost1(z)
  • When y = 0

    • Do the equivalent with the y=0 function plot

      • We call this function cost0(z)

The complete SVM cost function

SVM notation is slightly different

  • Get rid of the 1/m term

  • Since 1/m is a constant, we should get the same optimization

  • For logistic regression we had two terms

    • Training data set term (i.e. that we sum over m) = A

    • Regularization term (i.e. that we sum over n) = B

      • So we could describe it as A + λB
      • Set different values for λ to parametrize this trade-off
    • Instead of parameterization this as A + λB

      • For SVMs the convention is to use a different parameter called C
      • So do CA + B
      • If C were equal to 1/λ then the two functions (CA + B and A + λB) would give the same value
    • Our overall equation is

Large margin intuition

Some people refer to SVM as large margin intuition

  • Left is cost1 and right is cost0
  • Interesting property of SVM
    • If we have a positive example, we only really need z to be greater or equal to 0
      • If this is the case then we predict 1
    • SVM wants a bit more than that - doesn't want to just get it right, but have the value be quite a bit bigger than zero
      • Throws in an extra safety margin factor
  • SVM does the right thing if you use a normal size C
    • So the idea of SVM being a large margin classifier is only really relevant when you have no outliers and you can easily linearly separable data
  • Means we ignore a few outliers

Mathematics Behind Large Margin Classification

SVM decision boundary

Our optimization function can be re-defined as

  • θT xi = pi * ||θ||= θ1x1i + θ2x2i

  • What does this mean?
    • The constraints we defined earlier

      • T x) >= 1 if y = 1
      • T x) <= -1 if y = 0
    • Can be replaced/substituted with the constraints

      • pi * ||θ|| >= 1 if y = 1
      • pi * ||θ|| <= -1 if y = 0
    • Writing that into our optimization objective


Kernels I

Adapting SVM to non-linear classifiers, we want to find a non-linear boundary.

Come up with a complex set of polynomial features to fit the data

  • Have hθ(x) which
    • Returns 1 if the combined weighted sum of vectors (weighted by the parameter vector) is less than or equal to 0
    • Else return 0
  • Another way of writing this (new notation) is
    • That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order x terms
    • e.g.
      • hθ(x) = θ0+ θ1f1+ θ2f2 + θ3f3
      • Where
        • f1= x1
        • f2 = x1x2
        • f3 = ...
        • i.e. not specific values, but each of the terms from your complex polynomial function
  • New features

    • Given x, compute new feature depending on proximity to landmarks l(i)

      • fi = similarity(x, li) = exp(- (|| x - li ||2 ) / 2σ2)

      • This similarity function is called the Gaussian kernel

Diving deeper into the kernel

  • Say x is close to a landmark, then the squared distance will be ~ 0

    f1 is close to e0 = 1

  • Say x is far from a landmark, then the squared distance is big

    f is close to 0

Now plot f1 vs the kernel function

  • σ2 is a parameter of the Gaussian kernel

    Above example σ2 = 1

    Below σ2 =0.5

    Now if σ2 =3

Kernels II

Choosing the landmarks

  • Take the training data
  • For each example place a landmark at exactly the same location
  • So end up with m landmarks
    • One landmark per location per training example
    • Means our features measure how close to a training set example something is
  • Given a new example, compute all the f values
    • Gives you a feature vector f (f0 to fm)
      • f0 = 1 always

A more detailed look at generating the f vector

  • If we had a training example - features we compute would be using (xi, y


    • So we just cycle through each landmark, calculating how close to that landmark actually xi is
      • f1i, = k(xi, l1)
      • f2i, = k(x2, l2)
      • ...
      • fmi, = k(xi, lm)

SVM parameters(C)

  • Bias and variance trade off
  • Must chose C
    • C plays a role similar to 1/λ (whereλ is the regularization parameter)
  • Large C gives a hypothesis of low bias high variance --> overfitting
  • Small C gives a hypothesis of high bias low variance --> underfitting

SVM parameters(σ2 )

Parameter for calculating f values

  • Large σ2 - f features vary more smoothly - higher bias, lower variance
  • Small σ2 - f features vary abruptly - low bias, high variance

SVMs in Practice

Using An SVM

Choosing a kernel

  • Gaussian kernel
    • need to define σ2
    • if n is small or m is large
    • implement the kernel function
      • f1 = kernel(x1, x2) return a real number
      • Gaussian is probably most popular kernel
    • make sure you perform feature scaling before using a Gaussian kernel
  • Linear kernel
    • Predict y = 1 if (θT x) >= 0
      • So no f vector
      • Get a standard linear classifier
    • if n is large and m is small
      • Lots of features, few examples
      • Not enough data - risk overfitting in a high dimensional feature-space
  • Other choice
    • Not all similarity functions you develop are valid kernels
      • Must satisfy Merecer's Theorem
      • SVM use numerical optimization tricks
        • Mean certain optimizations can be made, but they must follow the theorem
    • Polynomial Kernel
      • We measure the similarity of x and l by doing one of
        • (xT l)2
        • (xT l)3
        • (xT l+1)3
      • General form is
        • (xT l+Con)D
      • If they're similar then the inner product tends to be large
      • Not used that often
      • Two parameters
        • Degree of polynomial (D)
        • Number you add to l (Con)
      • Usually performs worse than the Gaussian kernel
      • Used when x and l are both non-negative
    • String kernel
      • Used if input is text strings
      • Use for text classification
    • Chi-squared kernel
    • Histogram intersection kernel

Multi-class classification for SVM

  • Many packages have built in multi-class classification packages
  • Otherwise use one-vs all method
  • Not a big issue

Logistic regression vs. SVM

  • If n (features) is large vs. m (training set)
    • e.g. text classification problem
      • Feature vector dimension is 10 000
      • Training set is 10 - 1000
      • Then use logistic regression or SVM with a linear kernel
  • f n is small and m is intermediate
    • n = 1 - 1000
    • m = 10 - 10 000
    • Gaussian kernel is good
  • If n is small and m is large
    • n = 1 - 1000
    • m = 50 000+
      • SVM will be slow to run with Gaussian kernel
    • In that case
      • Manually create or add more features
      • Use logistic regression of SVM with a linear kernel


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now