# Machine Learning Week8 Unsupervised Learning

## Clustering

### Unsupervised Learning Introduction

• What is clustering good for
• Market segmentation - group customers into different market segments
• Social network analysis - Facebook "smartlists"
• Organizing computer clusters and data centers for network layout and location
• Astronomical data analysis - Understanding galaxy formation###

### K-Means Algorithm

#### overview

• Take unlabeled data and group into two clusters

• Algorithm overview

1. Randomly allocate two points as the cluster centroids

2. Cluster assignment step

• Go through each example and depending on if it's closer to the red or blue centroid assign each point to one of the two clusters
• To demonstrate this, we've gone through the data and "colour" each point red or blue

1. Move centroid step

• Take each centroid and move to the average of the correspondingly assigned data-points

1. epeat 2) and 3) until convergence
• More formal definition

• Input:

• K (number of clusters in the data)
• Training set {x1, x2, x3..., xn}
• Algorithm:

• Randomly initialize K cluster centroids as {μ1, μ2, μ3 ... μK}

• What if there's a centroid with no data

• Remove that centroid, so end up with K-1 classes
• Or, randomly reinitialize it
• Not sure when though...

### Optimization Objective

• While K-means is running we keep track of two sets of variables

• ci is the index of clusters {1,2, ..., K} to which xi is currently assigned
• i.e. there are m ci values, as each example has a ci value, and that value is one the the clusters (i.e. can only be one of K different values)
• μk, is the cluster associated with centroid k
• Locations of cluster centroid k
• So there are K
• So these the centroids which exist in the training data space
• μci, is the cluster centroid of the cluster to which example xi has been assigned to
• This is more for convenience than anything else
• You could look up that example i is indexed to cluster j (using the c vector), where j is between 1 and K
• Then look up the value associated with cluster j in the μ vector (i.e. what are the features associated with μj)
• But instead, for easy description, we have this variable which gets exactly the same value
• Lets say xi has been assigned to cluster 5
• Means that
• ci = 5
• μci, = μ5
• Using this notation we can write the optimization objective;

• This is sometimes called the distortion (or distortion cost function)

• We are finding the values which minimizes this function

#### Random initialization

• Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)

• Randomly pick K training examples
• Set μ1 up to μK to these example's values
• K means can converge to different solutions depending on the initialization setup

• Risk of local optimum

• The local optimum are valid convergence, but local optimum not global ones

• Algorithm as follow

• A typical number of times to initialize K-means is 50-1000
• Randomly initialize K-means
• For each 100 random initialization run K-means
• Then compute the distortion on the set of cluster assignments and centroids at convergent
• End with 100 ways of cluster the data
• Pick the clustering which gave the lowest distortion
• If we're running K means with 2-10 clusters can help find better global optimum

• If K is larger than 10, then multiple random initializations are less likely to be necessary
• First solution is probably good enough (better granularity of clustering)

### Choosing the Number of Clusters

Doing it automatically is hard, normally using visualizations to do it manually.

#### Elbow method

• Vary K and compute cost function at a range of K values

• As K increases J(...) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)

• Plot this (K vs J())

• Risks

• Normally you don't get a a nice line -> no clear elbow on curve

# Dimensionality Reduction

## Motivation

### Motivation I: Data Compression

#### Compression

• What does dimensionality reduction mean?
• In our example we plot a line

• Take exact example and record position on that line

• So before x1 was a 2D feature vector (X and Y dimensions)

• Now we can represent x1 as a 1D number (Z dimension)

• Another example 2D -> 3D

• Here's our data

• In the diagram below, imagine all our data points are sitting "inside" the blue tray (has a dark blue exterior face and a light blue inside)

• Plot values along those projections

### Motivation II: Visualization

• Example
• Collect a large data set about many facts of a country around the world

• So
• x1 = GDP
• ...
• x6 = mean household
• Say we have 50 features per country
• How can we understand this data better?
• Very hard to plot 50 dimensional data
• Using dimensionality reduction, instead of each country being represented by a 50-dimensional feature vector

• Come up with a different feature representation (z values) which summarize these features

## Principal Component Analysis

### principal Component Analysis Problem Formulation

• For the problem of dimensionality reduction the most commonly used algorithm is PCA

• So

• We have a 2D data set which we wish to reduce to 1D

• In other words, find a single line onto which to project this data

• How do we determine this line?

• The distance between each point and the projected version should be small (blue lines below are short)

• PCA tries to find a lower dimensional surface so the sum of squares onto that surface is minimized

• The blue lines are sometimes called the projection error

• PCA tries to find the surface (a straight line in this case) which has the minimum projection error

• As an aside, you should normally do mean normalization and feature scaling on your data before PCA

• A more formal description is

• For 2D-1D, we must find a vector u(1), which is of some dimensionality

• Onto which you can project the data so as to minimize the projection error

• u(1) can be positive or negative (-u(1)) which makes no difference

• Each of the vectors define the same red line
• In the more general case

• To reduce from nD to kD we
• Find k vectors (u(1), u(2), ... u(k)) onto which to project the data to minimize the projection error
• So lots of vectors onto which we project the data
• Find a set of vectors which we project the data onto the linear subspace spanned by that set of vectors
• We can define a point in a plane with k vectors

### Principal Component Analysis Algorithm

• Before applying PCA must do data preprocessing

• Given a set of m unlabeled examples we must do
• Mean normalization
• Replace each xji with xj - μj,
• In other words, determine the mean of each feature set, and then for each feature subtract the mean from the value, so we re-scale the mean to be 0
• Feature scaling (depending on data)
• If features have very different scales then scale so they all have a comparable range of values
• e.g. xji is set to (xj - μj) / sj
• Where sj is some measure of the range, so could be
• Biggest - smallest
• Standard deviation (more commonly)
• With preprocessing done, PCA finds the lower dimensional sub-space which minimizes the sum of the square

• In summary, for 2D->1D we'd be doing something like this;

• Need to compute two things;

• Compute the u vectors
• The new planes
• Need to compute the z vectors
• z vectors are the new, lower dimensionality feature vectors
• A mathematical derivation for the u vectors is very complicated

• But once you've done it, the procedure to find each u vector is not that hard

#### Algorithm description

• Reducing data from n-dimensional to k-dimensional

• Compute the covariance matrix

• This is commonly denoted as Σ (greek upper case sigma) - NOT summation symbol

• Σ = sigma

• This is an [n x n] matrix
• Remember than xi is a [n x 1] matrix
• In MATLAB or octave we can implement this as follows

• Compute eigenvectors of matrix Σ

• [U,S,V] = svd(sigma
• svd = singular value decomposition
• More numerically stable than eig
• eig = also gives eigenvector
• U,S and V are matrices

• U matrix is also an [n x n] matrix

• Turns out the columns of U are the u vectors we want!

• So to reduce a system from n-dimensions to k-dimensions

• Just take the first k-vectors from U (first k columns)

• Next we need to find some way to change x (which is n dimensional) to z (which is k dimensional)

• (reduce the dimensionality)
• Take first k columns of the u matrix and stack in columns
• n x k matrix - call this Ureduce
• We calculate z as follows
• z = (Ureduce)T * x
• So [k x n] * [n x 1]
• Generates a matrix which is
• k * 1
• If that's not witchcraft I don't know what is!
• So in summary

• Preprocessing
• Calculate sigma (covariance matrix)
• Calculate eigenvectors with svd
• Take k vectors from U (Ureduce= U(:,1:k);)
• Calculate z (z =Ureduce' * x;)

## Applying PCA

### Reconstruction from Compressed Repression

• Reconstruction

• We have an example as follows

• We have our examples (x1, x2 etc.)

• Project onto z-surface

• Given a point z1, how can we go back to the 2D space?

• xapprox = Ureduce * z

• To consider dimensions (and prove this really works)
• Ureduce = [n x k]
• z [k * 1]
• So
• xapprox = [n x 1]
• So this creates the following representation

### Choosing the Number of Principal Components

• k = number of principle components

• PCA tries to minimize averaged squared projection error

• Total variation in data can be defined as the average over data saying how far are the training examples from the origin

• When we're choosing k typical to use something like this

• Ratio between averaged squared projection error with total variation in data
• Want ratio to be small - means we retain 99% of the variance

• If it's small (0) then this is because the numerator is small
• The numerator is small when xi = xapproxi
• i.e. we lose very little information in the dimensionality reduction, so when we decompress we regenerate the same data
• So we chose k in terms of this ratio

• Often can significantly reduce data dimensionality while retaining the variance

• How do you do this

• How
1. Extract xs
• So we now have an unlabeled training set
2. Apply PCA to x vectors
• So we now have a reduced dimensional feature vector z
3. This gives you a new training set
• Each vector can be re-associated with the label
4. Take the reduced dimensionality data set and feed to a learning algorithm
• Use y as labels and z as feature vector
5. If you have a new example map from higher dimensionality vector to lower dimensionality vector, then feed into learning algorithm
• PCA maps one vector to a lower dimensionality vector
• x -> z
• Defined by PCA only on the training set
• The mapping computes a set of parameters
• Feature scaling values
• Ureduce
• Parameter learned by PCA
• Should be obtained only by determining PCA on your training set
• So we use those learned parameters for our
• Cross validation data
• Test set