Machine Learning Week2 Linear Regression with Multiple Variables

Machine Learning Week2 Linear Regression with Multiple Variables

Multivariate Linear Regression

Multiple Features

  • Multiple variables = multiple features
  • If in a new scheme we have more features to predict the price of the house
    • x1, x2, x3, x4 are the four features
      • x1 - size (feet squared)
      • x2 - Number of bedrooms
      • x3 - Number of floors
      • x4 - Age of home (years)
    • y is the output variable (price)
  • notation
    • n
      • number of features (n = 4)
    • m
      • number of examples (i.e. number of rows in a table)
    • xi
      • vector of the input for an example (so a vector of the four parameters for the ith input example)
      • i is an index into the training set
      • So
        • x is an n-dimensional feature vector
        • x3 is, for example, the 3rd house, and contains the four features associated with that house
    • xji
      • The value of feature j in the ith training example
      • So
        • x23 is, for example, the number of bedrooms in the third house
  • Now hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
    • For convenience of notation, x0 = 1
      • For every example i you have an additional 0th feature for each example
      • So now your feature vector is n + 1 dimensional feature vector indexed from 0
        • This is a column vector called x
        • Each example has a column vector associated with it
        • So let's say we have a new example called "X"
    • Parameters are also in a 0 indexed n+1 dimensional vector
      • This is also a column vector called θ
    • Considering this, hypothesis can be written as
      • hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
    • We do hθ(x) = θT X
      • θT is an [1 x n+1] matrix
      • In other words, θ is a column vector, the transposition of it is a row vector
      • So hθ(x) = θT X
        • [1 x n+1] * [n+1 x 1]
    • This is an example of multivariate linear regression

Gradient Descent for Multiple Variables

  • Our cost function is

  • Similarly, J(θ) is a function of the parameter vector

    • J(θ)

      Gradient descent

  • Once again, this is

    • θj = θj - learning rate (α) times the partial derivative of J(θ) with respect to θJ(...)
    • We do this through a simultaneous update of every θj value
  • When n = 1

Gradient Descent in Practice I - Feature Scaling

  • Some of the practical tricks

  • Feature scaling

    • Make sure those multiple features have a similar scale

      • Means gradient descent will converge more quickly
    • e.g.

      • x1 = size (0 - 2000 feet)
      • x2 = number of bedrooms (1-5)
      • Means the contours generated if we plot θ1 vs. θ2 give a very tall and thin shape due to the huge range difference
    • Running gradient descent on this kind of cost function can take a long time to find the global minimum

  • Can do mean normalization

    • Take a feature Xi
      • Replace it by (xi - mean) / max
      • So our values all have an average of about 0

Gradient Descent in Practice II - Learning Rate

Make sure gradient descent is working

  • Plot min J(θ) vs. no of iterations

  • If gradient descent is working then J(θ) should decrease after every iteration

    • Checking its working

      • If you plot J(θ) vs iterations and see the value is increasing - means you probably need a smaller α

        • Cause is because your minimizing a function which looks like this

    • So, reduce learning rate so we actually reach the minimum (green line)

  • However

    • If α is small enough, J(θ) will decrease on every iteration but the rate is too slow
  • Typically

    • Try a range of alpha values
    • Plot J(θ) vs number of iterations for each version of alpha
    • Go for roughly threefold increases
      • 0.001, 0.003, 0.01, 0.03. 0.1, 0.3

Features and Polynomial Regression

  • Polynomial regression for non-linear function

  • Example

    • House price prediction
      • Two features
        • Frontage - width of the plot of land along road (x1)
        • Depth - depth away from road (x2)
      • You don't have to use just two features
        • Can create new features
      • Might decide that an important feature is the land area
        • So, create a new feature = frontage * depth (x3)
        • h(x) = θ0 + θ1x3
          • Area is a better indicator
    • Often, by defining new features you may get a better model
  • Polynomial regression

    • May fit the data better
    • θ0 + θ1x + θ2x2 e.g. here we have a quadratic function
    • For housing data could use a quadratic function
      • But may not fit the data so well - inflection point means housing prices decrease when size gets really big
        • So instead must use a cubic function

Computing Parameters Analytically

Normal Equation

Example of normal equation

  • Here

    • m = 4 and n = 4
  • To implement the normal equation

    • Take examples

    • Add an extra column (x0 feature)

    • Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1] matrix

    • Do something similar for y

      • Construct a column vector y vector [m x 1] matrix
    • Using the following equation (X transpose * X) inverse times X transpose y

    • If you compute this, you get the value of theta which minimize the cost function

  • In a previous lecture discussed feature scaling

    • If you're using the normal equation then no need for feature scaling
    • Gradient descent
      • Need to chose learning rate
      • Needs many iterations - could make it slower
      • Works well even when n is massive (millions)
        • Better suited to big data
        • What is a big n though
          • 100 or even a 1000 is still (relativity) small
          • If n is 10 000 then look at using gradient descent
    • Normal equation
      • No need to chose a learning rate
      • No need to iterate, check for convergence etc.
      • Normal equation needs to compute (XT X)-1
        • This is the inverse of an n x n matrix
        • With most implementations computing a matrix inverse grows by O(n3 )
          • So not great
      • Slow of n is large
        • Can be much slower

Normal Equation Noninvertibility

When computing (XT X)-1* XT * y)

  • What if (XT X) is non-invertible (singular/degenerate)
    • Only some matrices are invertible
    • This should be quite a rare problem
      • Octave can invert matrices using
        • pinv (pseudo inverse)
          • This gets the right value even if (XT X) is non-invertible
        • inv (inverse)
  • What does it mean for (XT X) to be non-invertible
    • Normally two common causes
      • Redundant features in learning model
        • e.g.
          • x1 = size in feet
          • x2 = size in meters squared
      • Too many features
        • e.g. m <= n (m is much larger than n)
          • m = 10
          • n = 100
        • Trying to fit 101 parameters from 10 training examples
        • Sometimes work, but not always a good idea
        • Not enough data
        • Later look at why this may be too little data
        • To solve this we
          • Delete features
          • Use regularization (let's you use lots of features for a small training set)
  • If you find (XT X) to be non-invertible
    • Look at features --> are features linearly dependent?
      • So just delete one, will solve problem


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now