 # Machine Learning Week2 Linear Regression with Multiple Variables

## Multivariate Linear Regression

### Multiple Features

• Multiple variables = multiple features
• If in a new scheme we have more features to predict the price of the house
• x1, x2, x3, x4 are the four features
• x1 - size (feet squared)
• x2 - Number of bedrooms
• x3 - Number of floors
• x4 - Age of home (years)
• y is the output variable (price)
• notation
• n
• number of features (n = 4)
• m
• number of examples (i.e. number of rows in a table)
• xi
• vector of the input for an example (so a vector of the four parameters for the ith input example)
• i is an index into the training set
• So
• x is an n-dimensional feature vector
• x3 is, for example, the 3rd house, and contains the four features associated with that house
• xji
• The value of feature j in the ith training example
• So
• x23 is, for example, the number of bedrooms in the third house
• Now hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
• For convenience of notation, x0 = 1
• For every example i you have an additional 0th feature for each example
• So now your feature vector is n + 1 dimensional feature vector indexed from 0
• This is a column vector called x
• Each example has a column vector associated with it
• So let's say we have a new example called "X"
• Parameters are also in a 0 indexed n+1 dimensional vector
• This is also a column vector called θ
• Considering this, hypothesis can be written as
• hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
• We do hθ(x) = θT X
• θT is an [1 x n+1] matrix
• In other words, θ is a column vector, the transposition of it is a row vector
• So hθ(x) = θT X
• [1 x n+1] * [n+1 x 1]
• This is an example of multivariate linear regression

### Gradient Descent for Multiple Variables

• Our cost function is • Similarly, J(θ) is a function of the parameter vector

• J(θ)

Gradient descent • Once again, this is

• θj = θj - learning rate (α) times the partial derivative of J(θ) with respect to θJ(...)
• We do this through a simultaneous update of every θj value
• When n = 1 ### Gradient Descent in Practice I - Feature Scaling

• Some of the practical tricks

• Feature scaling

• Make sure those multiple features have a similar scale

• Means gradient descent will converge more quickly
• e.g.

• x1 = size (0 - 2000 feet)
• x2 = number of bedrooms (1-5)
• Means the contours generated if we plot θ1 vs. θ2 give a very tall and thin shape due to the huge range difference
• Running gradient descent on this kind of cost function can take a long time to find the global minimum • Can do mean normalization

• Take a feature Xi
• Replace it by (xi - mean) / max
• So our values all have an average of about 0 ### Gradient Descent in Practice II - Learning Rate

#### Make sure gradient descent is working

• Plot min J(θ) vs. no of iterations

• If gradient descent is working then J(θ) should decrease after every iteration • Checking its working

• If you plot J(θ) vs iterations and see the value is increasing - means you probably need a smaller α

• Cause is because your minimizing a function which looks like this • So, reduce learning rate so we actually reach the minimum (green line) • However

• If α is small enough, J(θ) will decrease on every iteration but the rate is too slow
• Typically

• Try a range of alpha values
• Plot J(θ) vs number of iterations for each version of alpha
• Go for roughly threefold increases
• 0.001, 0.003, 0.01, 0.03. 0.1, 0.3

### Features and Polynomial Regression

• Polynomial regression for non-linear function

• Example

• House price prediction
• Two features
• Frontage - width of the plot of land along road (x1)
• Depth - depth away from road (x2)
• You don't have to use just two features
• Can create new features
• Might decide that an important feature is the land area
• So, create a new feature = frontage * depth (x3)
• h(x) = θ0 + θ1x3
• Area is a better indicator
• Often, by defining new features you may get a better model
• Polynomial regression

• May fit the data better
• θ0 + θ1x + θ2x2 e.g. here we have a quadratic function
• For housing data could use a quadratic function
• But may not fit the data so well - inflection point means housing prices decrease when size gets really big
• So instead must use a cubic function ## Computing Parameters Analytically

### Normal Equation

#### Example of normal equation • Here

• m = 4 and n = 4
• To implement the normal equation

• Take examples

• Add an extra column (x0 feature)

• Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1] matrix

• Do something similar for y

• Construct a column vector y vector [m x 1] matrix
• Using the following equation (X transpose * X) inverse times X transpose y  • If you compute this, you get the value of theta which minimize the cost function

• In a previous lecture discussed feature scaling

• If you're using the normal equation then no need for feature scaling
• Need to chose learning rate
• Needs many iterations - could make it slower
• Works well even when n is massive (millions)
• Better suited to big data
• What is a big n though
• 100 or even a 1000 is still (relativity) small
• If n is 10 000 then look at using gradient descent
• Normal equation
• No need to chose a learning rate
• No need to iterate, check for convergence etc.
• Normal equation needs to compute (XT X)-1
• This is the inverse of an n x n matrix
• With most implementations computing a matrix inverse grows by O(n3 )
• So not great
• Slow of n is large
• Can be much slower

### Normal Equation Noninvertibility

When computing (XT X)-1* XT * y)

• What if (XT X) is non-invertible (singular/degenerate)
• Only some matrices are invertible
• This should be quite a rare problem
• Octave can invert matrices using
• pinv (pseudo inverse)
• This gets the right value even if (XT X) is non-invertible
• inv (inverse)
• What does it mean for (XT X) to be non-invertible
• Normally two common causes
• Redundant features in learning model
• e.g.
• x1 = size in feet
• x2 = size in meters squared
• Too many features
• e.g. m <= n (m is much larger than n)
• m = 10
• n = 100
• Trying to fit 101 parameters from 10 training examples
• Sometimes work, but not always a good idea
• Not enough data
• Later look at why this may be too little data
• To solve this we
• Delete features
• Use regularization (let's you use lots of features for a small training set)
• If you find (XT X) to be non-invertible
• Look at features --> are features linearly dependent?
• So just delete one, will solve problem