## Clustering

### Unsupervised Learning Introduction

- What is clustering good for
- Market segmentation - group customers into different market segments
- Social network analysis - Facebook "smartlists"
- Organizing computer clusters and data centers for network layout and location
- Astronomical data analysis - Understanding galaxy formation###

### K-Means Algorithm

#### overview

Take unlabeled data and group into two clusters

Algorithm overview

Randomly allocate two points as the cluster centroids

Cluster assignment step

- Go through each example and depending on if it's closer to the red or blue centroid assign each point to one of the two clusters

- To demonstrate this, we've gone through the data and "colour" each point red or blue

Move centroid step

- Take each centroid and move to the average of the correspondingly assigned data-points

- epeat 2) and 3) until convergence

More formal definition

Input:

- K (number of clusters in the data)
- Training set {x
_{1}, x_{2}, x_{3}..., x_{n}}

Algorithm:

Randomly initialize K cluster centroids as {μ

_{1}, μ_{2}, μ_{3}... μ_{K}}What if there's a centroid with no data

- Remove that centroid, so end up with K-1 classes
- Or, randomly reinitialize it
- Not sure when though...

### Optimization Objective

While K-means is running we keep track of two sets of variables

- c
^{i}is the index of clusters {1,2, ..., K} to which x^{i}is currently assigned- i.e. there are
*m*c^{i}values, as each example has a c^{i}value, and that value is one the the clusters (i.e. can only be one of K different values)

- i.e. there are
- μ
^{k}, is the cluster associated with centroid*k*- Locations of cluster centroid k
- So there are K
- So these the centroids which exist in the training data space

- μ
_{c}^{i}, is the cluster centroid of the cluster to which example x^{i}has been assigned to- This is more for convenience than anything else
- You could look up that example i is indexed to cluster j (using the c vector), where j is between 1 and K
- Then look up the value associated with cluster j in the μ vector (i.e. what are the features associated with μ
_{j}) - But instead, for easy description, we have this variable which gets exactly the same value

- Lets say x
^{i}has been assigned to cluster 5- Means that
- c
^{i}= 5 - μ
_{c}^{i}, = μ_{5}

- c

- Means that

- This is more for convenience than anything else

- c
Using this notation we can write the optimization objective;

This is sometimes called the distortion (or distortion cost function)

We are finding the values which minimizes this function

#### Random initialization

Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)

- Randomly pick K training examples
- Set μ
_{1}up to μ_{K}to these example's values

K means can converge to different solutions depending on the initialization setup

Risk of local optimum

The local optimum are valid convergence, but local optimum not global ones

Algorithm as follow

- A typical number of times to initialize K-means is 50-1000
- Randomly initialize K-means
- For each 100 random initialization run K-means
- Then compute the distortion on the set of cluster assignments and centroids at convergent
- End with 100 ways of cluster the data
- Pick the clustering which gave the lowest distortion

If we're running K means with 2-10 clusters can help find better global optimum

- If K is larger than 10, then multiple random initializations are less likely to be necessary
- First solution is probably good enough (better granularity of clustering)

### Choosing the Number of Clusters

Doing it automatically is hard, normally using visualizations to do it manually.

#### Elbow method

Vary K and compute cost function at a range of K values

As K increases J(...) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)

Plot this (K vs J())

Risks

- Normally you don't get a a nice line -> no clear elbow on curve
- Not really that helpful

# Dimensionality Reduction

## Motivation

### Motivation I: Data Compression

#### Compression

- What does dimensionality reduction mean?
In our example we plot a line

Take exact example and record position on that line

So before x

_{1}was a 2D feature vector (X and Y dimensions)Now we can represent x

_{1}as a 1D number (Z dimension)

Another example 2D -> 3D

- Here's our data

- In the diagram below, imagine all our data points are sitting "inside" the blue tray (has a dark blue exterior face and a light blue inside)

Plot values along those projections

### Motivation II: Visualization

- Example
Collect a large data set about many facts of a country around the world

- So
- x
_{1}= GDP - ...
- x
_{6}= mean household

- x
- Say we have 50 features per country
- How can we understand this data better?
- Very hard to plot 50 dimensional data

- So
Using dimensionality reduction, instead of each country being represented by a 50-dimensional feature vector

Come up with a different feature representation (z values) which summarize these features

## Principal Component Analysis

### principal Component Analysis Problem Formulation

For the problem of dimensionality reduction the most commonly used algorithm is PCA

So

We have a 2D data set which we wish to reduce to 1D

In other words, find a single line onto which to project this data

How do we determine this line?

The distance between each point and the projected version should be small (blue lines below are short)

PCA tries to find a lower dimensional surface so the sum of squares onto that surface is minimized

The blue lines are sometimes called the projection error

PCA tries to find the surface (a straight line in this case) which has the minimum projection error

As an aside, you should normally do mean normalization and feature scaling on your data before PCA

A more formal description is

For 2D-1D, we must find a vector u

^{(1)}, which is of some dimensionalityOnto which you can project the data so as to minimize the projection error

u

^{(1)}can be positive or negative (-u^{(1)}) which makes no difference- Each of the vectors define the same red line

In the more general case

- To reduce from nD to kD we
- Find
*k*vectors (u^{(1)}, u^{(2)}, ... u^{(k)}) onto which to project the data to minimize the projection error - So lots of vectors onto which we project the data
- Find a set of vectors which we project the data onto the linear subspace spanned by that set of vectors
- We can define a point in a plane with k vectors

- Find

- To reduce from nD to kD we

### Principal Component Analysis Algorithm

Before applying PCA must do data preprocessing

- Given a set of m unlabeled examples we must do
- Mean normalization
- Replace each x
_{j}^{i}with x_{j}- μ_{j},- In other words, determine the mean of each feature set, and then for each feature subtract the mean from the value, so we re-scale the mean to be 0

- Replace each x
- Feature scaling (depending on data)
- If features have very different scales then scale so they all have a comparable range of values
- e.g. x
_{j}^{i}is set to (x_{j}- μ_{j}) / s_{j}- Where s
_{j}is some measure of the range, so could be- Biggest - smallest
- Standard deviation (more commonly)

- Where s

- e.g. x

- If features have very different scales then scale so they all have a comparable range of values

- Mean normalization

- Given a set of m unlabeled examples we must do
With preprocessing done, PCA finds the lower dimensional sub-space which minimizes the sum of the square

In summary, for 2D->1D we'd be doing something like this;

Need to compute two things;

- Compute the u vectors
- The new planes

- Need to compute the z vectors
- z vectors are the new, lower dimensionality feature vectors

- Compute the u vectors
A mathematical derivation for the u vectors is very complicated

- But once you've done it, the procedure to find each u vector is not that hard

#### Algorithm description

Reducing data from n-dimensional to k-dimensional

Compute the covariance matrix

This is commonly denoted as Σ (greek upper case sigma) - NOT summation symbol

Σ = sigma

- This is an [n x n] matrix
- Remember than xi is a [n x 1] matrix

- This is an [n x n] matrix
In MATLAB or octave we can implement this as follows

Compute eigenvectors of matrix Σ

- [U,S,V] = svd(sigma
- svd = singular value decomposition
- More numerically stable than eig

- eig = also gives eigenvector

- svd = singular value decomposition

- [U,S,V] = svd(sigma
U,S and V are matrices

U matrix is also an [n x n] matrix

Turns out the columns of U are the u vectors we want!

So to reduce a system from n-dimensions to k-dimensions

- Just take the first
*k-vectors*from U (first k columns)

- Just take the first

Next we need to find some way to change x (which is n dimensional) to z (which is k dimensional)

- (reduce the dimensionality)
- Take first k columns of the u matrix and stack in columns
- n x k matrix - call this Ureduce

- We calculate z as follows
- z = (U
_{reduce})^{T}* x- So [k x n] * [n x 1]
- Generates a matrix which is
- k * 1

- If that's not witchcraft I don't know what is!

- z = (U

So in summary

- Preprocessing
- Calculate sigma (covariance matrix)
- Calculate eigenvectors with svd
- Take k vectors from U (U
_{reduce}= U(:,1:k);) - Calculate z (z =U
_{reduce}' * x;)

## Applying PCA

### Reconstruction from Compressed Repression

Reconstruction

We have an example as follows

We have our examples (x

^{1}, x^{2}etc.)Project onto z-surface

Given a point z

^{1}, how can we go back to the 2D space?

x

_{approx}= U_{reduce}* z- To consider dimensions (and prove this really works)
- U
_{reduce}= [n x k] - z [k * 1]

- U
- So
- x
_{approx}= [n x 1]

- x

- To consider dimensions (and prove this really works)
So this creates the following representation

### Choosing the Number of Principal Components

k = number of principle components

PCA tries to minimize averaged squared projection error

Total variation in data can be defined as the average over data saying how far are the training examples from the origin

When we're choosing k typical to use something like this

- Ratio between averaged squared projection error with total variation in data

Want ratio to be small - means we retain 99% of the variance

- If it's small (0) then this is because the numerator is small
- The numerator is small when x
^{i}= x_{approx}^{i}- i.e. we lose very little information in the dimensionality reduction, so when we decompress we regenerate the same data

- The numerator is small when x

- If it's small (0) then this is because the numerator is small
So we chose k in terms of this ratio

Often can significantly reduce data dimensionality while retaining the variance

How do you do this

### Advice for Applying PCA

- How
- Extract xs
- So we now have an unlabeled training set

- Apply PCA to x vectors
- So we now have a reduced dimensional feature vector z

- This gives you a new training set
- Each vector can be re-associated with the label

- Take the reduced dimensionality data set and feed to a learning algorithm
- Use y as labels and z as feature vector

- If you have a new example map from higher dimensionality vector to lower dimensionality vector, then feed into learning algorithm

- Extract xs
- PCA maps one vector to a lower dimensionality vector
- x -> z
- Defined by PCA only on the training set
- The mapping computes a set of parameters
- Feature scaling values
- U
_{reduce}- Parameter learned by PCA
- Should be obtained only by determining PCA on your training set

- So we use those learned parameters for our
- Cross validation data
- Test set