## Introduction

### What is machine learning?

- Two definitions:
- Arthur Samuel(1959)
- "The field of study that gives computers the ability to learn without being explicitly programmed."
- This is an older, informal definition.

- Tom Mitchell(1999)
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

The checkers example.- E = the experience of playing many games of checkers
- T = the task of playing checkers.
- P = the probability that the program will win the next game.

- Arthur Samuel(1959)
- Several types of learning algorithms
- Supervised learning
- Unsupervised learning
- Reinforcement learning
- Recommender systems

### Supervised learning - introduction

Brief Introduction

In supervised learning, we are given a data set and already know what our correct output should look like. We teach the computer learn to do something, then let it use it to predict the new output.

Categories

Regression

We are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.

No real discrete delineation.

Example

Housing price prediction

Problem : "Given this data, a people has a house 750 square feet - how much can they be expected to get?

Classification

We are trying to predict results in a discrete output, meaning that we are trying to classify data into one or two discrete classes(and always 0 or 1).

Example

Breast cancer prediction

Problem : "Can we define breast cancer as malignant or benign based on tumor size?"

### Unsupervised learning - introduction

Brief introduction

In unsupervised learning, we get unlabeled data, and just told computer "Here is a data set, can you structure it?"

One way to do this would be to cluster data into groups.

Example

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (identifying individual voices and music from a mesh of sounds at a cocktail party).

## Model and Cost Function

### Model Representation

We use x^{(i)} to denote the “input” variables, and y^{(i)} to denote the “output” or target variable that we are trying to predict. A pair (x^{(i)},y^{(i)}) is called a training example, and the dataset that we’ll be using to learn—a list of m training examples (x^{(i)},y^{(i)}); i=1,...,m——is called a training set. We will also use X to denote the space of input values, and Y to denote the space of output values.

Our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. The process is therefore like this:

When the target variable that we’re trying to predict is continuous, we call the learning problem a regression problem. When y can take on only a small number of discrete values, we call it a classification problem.

Example :

House price prediction —— Supervised learning regression problem

Training set (data set)

Notation

m = number of training examples

x's = input variables / features

y's = output variables / features

(x,y) - single training example

(x^{(i)},y^{(i)}) - specific example (i^{th} training example)

Pass into a learning algorithm

Algorithm outputs a function (denoted h) (h = hypothesis)

Take this function h_{θ}(x) = θ_{0} + θ_{1}x as a simple example. (shorthand: h(x))

In this function, Y is a linear function of x.

### Cost Function

We use cost function to measure the accuracy of our hypothesis function.

A cost function lets us figure out how to find the best function h(x) fit our data.

This function is otherwise called the "Squared error function", or "Mean squared error".

What we want to do is minimize the cost function J(θ_{0},θ_{1}).

Minimize squared different between predicted house price and actual house price

- 1/2m
- 1/m - means we determine the average.
- 1/2m - the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest value is still the smallest value!).

- Minimizing θ0/θ1 means we get the values of θ0 and θ1 which find on average the minimal deviation of x from y when we use those parameters in our hypothesis function.

We should find the lowest point in the graph and θ at this time.

## Parameter Learning

### Gradient Descent

A algorithm used all over the machine learning for minimization.

The gradient descent algorithm is :

repeat following until convergence:

At each iteration j, one should simultaneously update the parameters θ_{1},θ_{2},...,θ_{n}.

Notation

":=" denotes assignment.

α (alpha)

Is a number called the **learning rate**

Controls how big a step you take

If α is big have an aggressive gradient descent

If α is small take tiny steps

Derivative term

We put θ_{0} on the x axis and θ_{1} on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The black arrows show the minimum points in the graph.