Machine Learning Week8 Unsupervised Learning

Machine Learning Week8 Unsupervised Learning


Unsupervised Learning Introduction

  • What is clustering good for
    • Market segmentation - group customers into different market segments
    • Social network analysis - Facebook "smartlists"
    • Organizing computer clusters and data centers for network layout and location
    • Astronomical data analysis - Understanding galaxy formation###

K-Means Algorithm


  • Take unlabeled data and group into two clusters

  • Algorithm overview

    1. Randomly allocate two points as the cluster centroids

    2. Cluster assignment step

      • Go through each example and depending on if it's closer to the red or blue centroid assign each point to one of the two clusters
    • To demonstrate this, we've gone through the data and "colour" each point red or blue

  1. Move centroid step

    • Take each centroid and move to the average of the correspondingly assigned data-points

  1. epeat 2) and 3) until convergence
  • More formal definition

    • Input:

      • K (number of clusters in the data)
      • Training set {x1, x2, x3..., xn}
    • Algorithm:

      • Randomly initialize K cluster centroids as {μ1, μ2, μ3 ... μK}

      • What if there's a centroid with no data

        • Remove that centroid, so end up with K-1 classes
        • Or, randomly reinitialize it
          • Not sure when though...

Optimization Objective

  • While K-means is running we keep track of two sets of variables

    • ci is the index of clusters {1,2, ..., K} to which xi is currently assigned
      • i.e. there are m ci values, as each example has a ci value, and that value is one the the clusters (i.e. can only be one of K different values)
    • μk, is the cluster associated with centroid k
      • Locations of cluster centroid k
      • So there are K
      • So these the centroids which exist in the training data space
    • μci, is the cluster centroid of the cluster to which example xi has been assigned to
      • This is more for convenience than anything else
        • You could look up that example i is indexed to cluster j (using the c vector), where j is between 1 and K
        • Then look up the value associated with cluster j in the μ vector (i.e. what are the features associated with μj)
        • But instead, for easy description, we have this variable which gets exactly the same value
      • Lets say xi has been assigned to cluster 5
        • Means that
          • ci = 5
          • μci, = μ5
  • Using this notation we can write the optimization objective;

    • This is sometimes called the distortion (or distortion cost function)

    • We are finding the values which minimizes this function

Random initialization

  • Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)

    • Randomly pick K training examples
    • Set μ1 up to μK to these example's values
  • K means can converge to different solutions depending on the initialization setup

    • Risk of local optimum

    • The local optimum are valid convergence, but local optimum not global ones

  • Algorithm as follow

    • A typical number of times to initialize K-means is 50-1000
    • Randomly initialize K-means
      • For each 100 random initialization run K-means
      • Then compute the distortion on the set of cluster assignments and centroids at convergent
      • End with 100 ways of cluster the data
      • Pick the clustering which gave the lowest distortion
  • If we're running K means with 2-10 clusters can help find better global optimum

    • If K is larger than 10, then multiple random initializations are less likely to be necessary
    • First solution is probably good enough (better granularity of clustering)

Choosing the Number of Clusters

Doing it automatically is hard, normally using visualizations to do it manually.

Elbow method

  • Vary K and compute cost function at a range of K values

  • As K increases J(...) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)

    • Plot this (K vs J())

  • Risks

    • Normally you don't get a a nice line -> no clear elbow on curve
    • Not really that helpful

Dimensionality Reduction


Motivation I: Data Compression


  • What does dimensionality reduction mean?
    • In our example we plot a line

    • Take exact example and record position on that line

    • So before x1 was a 2D feature vector (X and Y dimensions)

    • Now we can represent x1 as a 1D number (Z dimension)

  • Another example 2D -> 3D

    • Here's our data

    • In the diagram below, imagine all our data points are sitting "inside" the blue tray (has a dark blue exterior face and a light blue inside)

    • Plot values along those projections

Motivation II: Visualization

  • Example
    • Collect a large data set about many facts of a country around the world

      • So
        • x1 = GDP
        • ...
        • x6 = mean household
      • Say we have 50 features per country
      • How can we understand this data better?
        • Very hard to plot 50 dimensional data
    • Using dimensionality reduction, instead of each country being represented by a 50-dimensional feature vector

      • Come up with a different feature representation (z values) which summarize these features

Principal Component Analysis

principal Component Analysis Problem Formulation

  • For the problem of dimensionality reduction the most commonly used algorithm is PCA

  • So

    • We have a 2D data set which we wish to reduce to 1D

    • In other words, find a single line onto which to project this data

      • How do we determine this line?

        • The distance between each point and the projected version should be small (blue lines below are short)

        • PCA tries to find a lower dimensional surface so the sum of squares onto that surface is minimized

        • The blue lines are sometimes called the projection error

          • PCA tries to find the surface (a straight line in this case) which has the minimum projection error

        • As an aside, you should normally do mean normalization and feature scaling on your data before PCA

  • A more formal description is

    • For 2D-1D, we must find a vector u(1), which is of some dimensionality

    • Onto which you can project the data so as to minimize the projection error

    • u(1) can be positive or negative (-u(1)) which makes no difference

      • Each of the vectors define the same red line
  • In the more general case

    • To reduce from nD to kD we
      • Find k vectors (u(1), u(2), ... u(k)) onto which to project the data to minimize the projection error
      • So lots of vectors onto which we project the data
      • Find a set of vectors which we project the data onto the linear subspace spanned by that set of vectors
        • We can define a point in a plane with k vectors

Principal Component Analysis Algorithm

  • Before applying PCA must do data preprocessing

    • Given a set of m unlabeled examples we must do
      • Mean normalization
        • Replace each xji with xj - μj,
          • In other words, determine the mean of each feature set, and then for each feature subtract the mean from the value, so we re-scale the mean to be 0
      • Feature scaling (depending on data)
        • If features have very different scales then scale so they all have a comparable range of values
          • e.g. xji is set to (xj - μj) / sj
            • Where sj is some measure of the range, so could be
              • Biggest - smallest
              • Standard deviation (more commonly)
  • With preprocessing done, PCA finds the lower dimensional sub-space which minimizes the sum of the square

    • In summary, for 2D->1D we'd be doing something like this;

    • Need to compute two things;

      • Compute the u vectors
        • The new planes
      • Need to compute the z vectors
        • z vectors are the new, lower dimensionality feature vectors
    • A mathematical derivation for the u vectors is very complicated

      • But once you've done it, the procedure to find each u vector is not that hard

Algorithm description

  • Reducing data from n-dimensional to k-dimensional

    • Compute the covariance matrix

      • This is commonly denoted as Σ (greek upper case sigma) - NOT summation symbol

      • Σ = sigma

        • This is an [n x n] matrix
          • Remember than xi is a [n x 1] matrix
      • In MATLAB or octave we can implement this as follows

    • Compute eigenvectors of matrix Σ

      • [U,S,V] = svd(sigma
        • svd = singular value decomposition
          • More numerically stable than eig
        • eig = also gives eigenvector
    • U,S and V are matrices

      • U matrix is also an [n x n] matrix

      • Turns out the columns of U are the u vectors we want!

      • So to reduce a system from n-dimensions to k-dimensions

        • Just take the first k-vectors from U (first k columns)

  • Next we need to find some way to change x (which is n dimensional) to z (which is k dimensional)

    • (reduce the dimensionality)
    • Take first k columns of the u matrix and stack in columns
      • n x k matrix - call this Ureduce
    • We calculate z as follows
      • z = (Ureduce)T * x
        • So [k x n] * [n x 1]
        • Generates a matrix which is
          • k * 1
        • If that's not witchcraft I don't know what is!
  • So in summary

    • Preprocessing
    • Calculate sigma (covariance matrix)
    • Calculate eigenvectors with svd
    • Take k vectors from U (Ureduce= U(:,1:k);)
    • Calculate z (z =Ureduce' * x;)

Applying PCA

Reconstruction from Compressed Repression

  • Reconstruction

    • We have an example as follows

    • We have our examples (x1, x2 etc.)

    • Project onto z-surface

    • Given a point z1, how can we go back to the 2D space?

  • xapprox = Ureduce * z

    • To consider dimensions (and prove this really works)
      • Ureduce = [n x k]
      • z [k * 1]
    • So
      • xapprox = [n x 1]
  • So this creates the following representation

Choosing the Number of Principal Components

  • k = number of principle components

  • PCA tries to minimize averaged squared projection error

  • Total variation in data can be defined as the average over data saying how far are the training examples from the origin

  • When we're choosing k typical to use something like this

    • Ratio between averaged squared projection error with total variation in data
  • Want ratio to be small - means we retain 99% of the variance

    • If it's small (0) then this is because the numerator is small
      • The numerator is small when xi = xapproxi
        • i.e. we lose very little information in the dimensionality reduction, so when we decompress we regenerate the same data
  • So we chose k in terms of this ratio

  • Often can significantly reduce data dimensionality while retaining the variance

  • How do you do this

Advice for Applying PCA

  • How
    1. Extract xs
      • So we now have an unlabeled training set
    2. Apply PCA to x vectors
      • So we now have a reduced dimensional feature vector z
    3. This gives you a new training set
      • Each vector can be re-associated with the label
    4. Take the reduced dimensionality data set and feed to a learning algorithm
      • Use y as labels and z as feature vector
    5. If you have a new example map from higher dimensionality vector to lower dimensionality vector, then feed into learning algorithm
  • PCA maps one vector to a lower dimensionality vector
    • x -> z
    • Defined by PCA only on the training set
    • The mapping computes a set of parameters
      • Feature scaling values
      • Ureduce
        • Parameter learned by PCA
        • Should be obtained only by determining PCA on your training set
    • So we use those learned parameters for our
      • Cross validation data
      • Test set


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now