## Multivariate Linear Regression

### Multiple Features

- Multiple variables = multiple features
- If in a new scheme we have more features to predict the price of the house
- x
_{1}, x_{2}, x_{3}, x_{4}are the four features- x1 - size (feet squared)
- x2 - Number of bedrooms
- x3 - Number of floors
- x4 - Age of home (years)

- y is the output variable (price)

- x

- notation
- n
- number of features (n = 4)

- m
- number of examples (i.e. number of rows in a table)

- x
^{i}- vector of the input for an example (so a vector of the four parameters for the i
^{th}input example) - i is an index into the training set
- So
- x is an n-dimensional feature vector
- x
^{3}is, for example, the 3rd house, and contains the four features associated with that house

- vector of the input for an example (so a vector of the four parameters for the i
- x
_{j}^{i}- The value of feature j in the i
^{th}training example - So
- x
_{2}^{3}is, for example, the number of bedrooms in the third house

- x

- The value of feature j in the i

- n
- Now h
_{θ}(x) = θ_{0}+ θ_{1}x_{1}+ θ_{2}x_{2}+ θ_{3}x_{3}+ θ_{4}x_{4}- For convenience of notation, x
_{0}= 1- For every example i you have an additional 0th feature for each example
- So now your feature vector is n + 1 dimensional feature vector indexed from 0
- This is a column vector called x
- Each example has a column vector associated with it
- So let's say we have a new example called "X"

- Parameters are also in a 0 indexed n+1 dimensional vector
- This is also a column vector called θ

- Considering this, hypothesis can be written as
- h
_{θ}(x) = θ_{0}x_{0}+ θ_{1}x_{1}+ θ_{2}x_{2}+ θ_{3}x_{3}+ θ_{4}x_{4}

- h
- We do h
_{θ}(x) = θ^{T}X- θ
^{T}is an [1 x n+1] matrix - In other words, θ is a column vector, the transposition of it is a row vector
- So h
_{θ}(x) = θ^{T}X- [1 x n+1] * [n+1 x 1]

- θ
- This is an example of multivariate linear regression

- For convenience of notation, x

### Gradient Descent for Multiple Variables

Our cost function is

Similarly, J(θ) is a function of the parameter vector

J(θ)

**Gradient descent**

Once again, this is

- θ
_{j}= θ_{j}- learning rate (α) times the partial derivative of J(θ) with respect to θ_{J}(...) - We do this through a
**simultaneous update**of every θ_{j}value

- θ
When n = 1

### Gradient Descent in Practice I - Feature Scaling

Some of the practical tricks

Feature scaling

Make sure those multiple features have a similar scale

- Means gradient descent will converge more quickly

e.g.

- x
_{1}= size (0 - 2000 feet) - x
_{2}= number of bedrooms (1-5) - Means the contours generated if we plot θ1 vs. θ2 give a very tall and thin shape due to the huge range difference

- x
Running gradient descent on this kind of cost function can take a long time to find the global minimum

Can do

**mean normalization**- Take a feature X
_{i}- Replace it by (x
_{i}- mean) / max - So our values all have an average of about 0

- Replace it by (x

- Take a feature X

### Gradient Descent in Practice II - Learning Rate

#### Make sure gradient descent is working

Plot min J(θ) vs. no of iterations

If gradient descent is working then J(θ) should decrease after every iteration

Checking its working

If you plot J(θ) vs iterations and see the value is increasing - means you probably need a smaller α

- Cause is because your minimizing a function which looks like this

So, reduce learning rate so we actually reach the minimum (green line)

However

- If α is small enough, J(θ) will decrease on every iteration but the rate is too slow

Typically

- Try a range of alpha values
- Plot J(θ) vs number of iterations for each version of alpha
- Go for roughly threefold increases
- 0.001, 0.003, 0.01, 0.03. 0.1, 0.3

### Features and Polynomial Regression

Polynomial regression for non-linear function

Example

- House price prediction
- Two features
- Frontage - width of the plot of land along road (x1)
- Depth - depth away from road (x2)

- You don't have to use just two features
- Can create new features

- Might decide that an important feature is the land area
- So, create a new feature = frontage * depth (x3)
- h(x) = θ
_{0}+ θ_{1}x_{3}- Area is a better indicator

- Two features
- Often, by defining new features you may get a better model

- House price prediction
Polynomial regression

- May fit the data better
- θ
_{0}+ θ_{1}x + θ_{2}x_{2}e.g. here we have a quadratic function - For housing data could use a quadratic function
- But may not fit the data so well - inflection point means housing prices decrease when size gets really big
- So instead must use a cubic function

- But may not fit the data so well - inflection point means housing prices decrease when size gets really big

## Computing Parameters Analytically

### Normal Equation

#### Example of normal equation

Here

- m = 4 and n = 4

To implement the normal equation

Take examples

Add an extra column (x0 feature)

Construct a matrix (X -

**the design matrix**) which contains all the training data features in an [m x n+1] matrixDo something similar for y

- Construct a column vector y vector [m x 1] matrix

Using the following equation (X transpose * X) inverse times X transpose y

If you compute this, you get the value of theta which minimize the cost function

In a previous lecture discussed feature scaling

- If you're using the normal equation then no need for feature scaling
- Gradient descent
- Need to chose learning rate
- Needs many iterations - could make it slower
- Works well even when n is massive (millions)
- Better suited to big data
- What is a big n though
- 100 or even a 1000 is still (relativity) small
- If n is 10 000 then look at using gradient descent

- Normal equation
- No need to chose a learning rate
- No need to iterate, check for convergence etc.
- Normal equation needs to compute (X
^{T}X)^{-1}- This is the inverse of an n x n matrix
- With most implementations computing a matrix inverse grows by O(n
^{3})- So not great

- Slow of
*n*is large- Can be much slower

### Normal Equation Noninvertibility

When computing (X^{T} X)^{-1}* X^{T} * y)

- What if (X
^{T}X) is non-invertible (singular/degenerate)- Only some matrices are invertible
- This should be quite a rare problem
- Octave can invert matrices using
- pinv (pseudo inverse)
- This gets the right value even if (X
^{T}X) is non-invertible

- This gets the right value even if (X
- inv (inverse)

- pinv (pseudo inverse)

- Octave can invert matrices using

- What does it mean for (X
^{T}X) to be non-invertible- Normally two common causes
- Redundant features in learning model
- e.g.
- x1 = size in feet
- x2 = size in meters squared

- e.g.
- Too many features
- e.g. m <= n (m is much larger than n)
- m = 10
- n = 100

- Trying to fit 101 parameters from 10 training examples
- Sometimes work, but not always a good idea
- Not enough data
- Later look at
*why*this may be too little data - To solve this we
- Delete features
- Use
**regularization**(let's you use lots of features for a small training set)

- e.g. m <= n (m is much larger than n)

- Redundant features in learning model

- Normally two common causes
- If you find (X
^{T}X) to be non-invertible- Look at features --> are features linearly dependent?
- So just delete one, will solve problem

- Look at features --> are features linearly dependent?