What is machine learning?
- Two definitions:
- Arthur Samuel(1959)
- "The field of study that gives computers the ability to learn without being explicitly programmed."
- This is an older, informal definition.
- Tom Mitchell(1999)
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."The checkers example.
- E = the experience of playing many games of checkers
- T = the task of playing checkers.
- P = the probability that the program will win the next game.
- Arthur Samuel(1959)
- Several types of learning algorithms
- Supervised learning
- Unsupervised learning
- Reinforcement learning
- Recommender systems
Supervised learning - introduction
In supervised learning, we are given a data set and already know what our correct output should look like. We teach the computer learn to do something, then let it use it to predict the new output.
We are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.
No real discrete delineation.
Housing price prediction
Problem : "Given this data, a people has a house 750 square feet - how much can they be expected to get?
We are trying to predict results in a discrete output, meaning that we are trying to classify data into one or two discrete classes(and always 0 or 1).
Breast cancer prediction
Problem : "Can we define breast cancer as malignant or benign based on tumor size?"
Unsupervised learning - introduction
In unsupervised learning, we get unlabeled data, and just told computer "Here is a data set, can you structure it?"
One way to do this would be to cluster data into groups.
Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.
Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (identifying individual voices and music from a mesh of sounds at a cocktail party).
Model and Cost Function
We use x(i) to denote the “input” variables, and y(i) to denote the “output” or target variable that we are trying to predict. A pair (x(i),y(i)) is called a training example, and the dataset that we’ll be using to learn—a list of m training examples (x(i),y(i)); i=1,...,m——is called a training set. We will also use X to denote the space of input values, and Y to denote the space of output values.
Our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. The process is therefore like this:
When the target variable that we’re trying to predict is continuous, we call the learning problem a regression problem. When y can take on only a small number of discrete values, we call it a classification problem.
House price prediction —— Supervised learning regression problem
Training set (data set)
m = number of training examples
x's = input variables / features
y's = output variables / features
(x,y) - single training example
(x(i),y(i)) - specific example (ith training example)
Pass into a learning algorithm
Algorithm outputs a function (denoted h) (h = hypothesis)
Take this function hθ(x) = θ0 + θ1x as a simple example. (shorthand: h(x))
In this function, Y is a linear function of x.
We use cost function to measure the accuracy of our hypothesis function.
A cost function lets us figure out how to find the best function h(x) fit our data.
This function is otherwise called the "Squared error function", or "Mean squared error".
What we want to do is minimize the cost function J(θ0,θ1).
Minimize squared different between predicted house price and actual house price
- 1/m - means we determine the average.
- 1/2m - the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest value is still the smallest value!).
- Minimizing θ0/θ1 means we get the values of θ0 and θ1 which find on average the minimal deviation of x from y when we use those parameters in our hypothesis function.
We should find the lowest point in the graph and θ at this time.
A algorithm used all over the machine learning for minimization.
The gradient descent algorithm is :
repeat following until convergence:
At each iteration j, one should simultaneously update the parameters θ1,θ2,...,θn.
":=" denotes assignment.
Is a number called the learning rate
Controls how big a step you take
If α is big have an aggressive gradient descent
If α is small take tiny steps
We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.
We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The black arrows show the minimum points in the graph.
Gradient Descent For Linear Regression