Density Estimation
Problem Motivation
 We have a dataset which contains normal(data)
 How we ensure they're normal is up to us
 In reality it's OK if there are a few which aren't actually normal
 Using that dataset as a reference point we can see if other examples are anomalous
First, using our training dataset we build a model
 We can access this model using p(x)
 This asks, "What is the probability that example x is normal"
 We can access this model using p(x)
Having built a model
 if p(x_{test}) < ε > flag this as an anomaly
 if p(x_{test}) >= ε > this is OK
 ε is some threshold probability value which we define, depending on how sure we need/want to be
We expect our model to (graphically) look something like this
Gaussian Distribution
Also called the normal distribution
Example
Say x (data set) is made up of real numbers
 Mean is μ
 Variance is σ^{2}
 σ is also called the standard deviation specifies the width of the Gaussian probability
 The data has a Gaussian distribution
Then we can write this ~N(μ,σ^{2} )
 ~ means = is distributed as
 N (should really be "script" N (even curlier!) > means normal distribution
 μ, σ^{2} represent the mean and variance, respectively
 These are the two parameters a Gaussian means
Looks like this;
Gaussian equation is
 P(x : μ , σ^{2}) (probability of x, parameterized by the mean and squared variance)
Some examples of Gaussians below
 Area is always the same (must = 1)
 But width changes as standard deviation changes
Estimating μ and σ2
μ = average of examples
σ^{2} = standard deviation squared
As a side comment
These parameters are the maximum likelihood estimation values for μ and σ^{2}
You can also do 1/(m) or 1/(m1) doesn't make too much difference
 Slightly different mathematical problems, but in practice it makes little difference
Algorithm
 Unlabeled training set of m examples
Data = {x_{1}, x_{2}, ..., x_{m} }
 Each example is an ndimensional vector (i.e. a feature vector)
Model P(x) from the data set
 What are high probability features and low probability features
 x is a vector
 So model p(x) as
 = p(x_{1}; μ_{1} , σ_{1}^{2}) * p(x_{2}; μ_{2} , σ_{2}^{2}) * ... p(x_{n}; μ_{n} , σ_{n}^{2})
 Multiply the probability of each features by each feature
 We model each of the features by assuming each feature is distributed according to a Gaussian distribution
 p(x_{i}; μ_{i} , σ_{i}^{2})
 The probability of feature x_{i} given μ_{i} and σ_{i}^{2}, using a Gaussian distribution
As a side comment
 Turns out this equation makes an independence assumption for the features, although algorithm works if features are independent or not
 Don't worry too much about this, although if you're features are tightly linked you should be able to do some dimensionality reduction anyway!
 Turns out this equation makes an independence assumption for the features, although algorithm works if features are independent or not
We can write this chain of multiplication more compactly as follows;
The problem of estimation this distribution is sometimes call the problem of density estimation
Algorithm
 1  Chose features
 Try to come up with features which might help identify something anomalous  may be unusually large or small values
 More generally, chose features which describe the general properties
 This is nothing unique to anomaly detection  it's just the idea of building a sensible feature vector
 2  Fit parameters
 Determine parameters for each of your examples μ_{i} and σ_{i}^{2}
 Fit is a bit misleading, really should just be "Calculate parameters for 1 to n"
 So you're calculating standard deviation and mean for each feature
 You should of course used some vectorized implementation rather than a loop probably
 Determine parameters for each of your examples μ_{i} and σ_{i}^{2}
 3  compute p(x)
 You compute the formula shown (i.e. the formula for the Gaussian probability)
 If the number is very small, very low chance of it being "normal"
 1  Chose features
Building an Anomaly Detection System
Developing and Evaluating an Anomaly Detection System
 Algorithm evaluation
 Take trainings set { x^{1}, x^{2}, ..., x^{m} }
 Fit model p(x)
 On cross validation and test set, test the example x
 y = 1 if p(x) < epsilon (anomalous)
 y = 0 if p(x) >= epsilon (normal)
 Think of algorithm a trying to predict if something is anomalous
 But you have a label so can check!
 Makes it look like a supervised learning algorithm
 Take trainings set { x^{1}, x^{2}, ..., x^{m} }
 What's a good metric to use for evaluation
 y = 0 is very common
 So classification would be bad
 Compute fraction of true positives/false positive/false negative/true negative
 Compute precision/recall
 Compute F1score
 y = 0 is very common
 Earlier, also had epsilon (the threshold value)
 Threshold to show when something is anomalous
 If you have CV set you can see how varying epsilon effects various evaluation metrics
 Then pick the value of epsilon which maximizes the score on your CV set
 Evaluate algorithm using cross validation
 Do final algorithm evaluation on the test set
Anomaly Detection vs. Supervised Learning
Anomaly detection
 Very small number of positive examples
 Save positive examples just for CV and test set
 Consider using an anomaly detection algorithm
 Not enough data to "learn" positive examples
 Have a very large number of negative examples
 Use these negative examples for p(x) fitting
 Only need negative examples for this
 Many "types" of anomalies
 Hard for an algorithm to learn from positive examples when anomalies may look nothing like one another
 So anomaly detection doesn't know what they look like, but knows what they don't look like
 When we looked at SPAM email,
 Many types of SPAM
 For the spam problem, usually enough positive examples
 So this is why we usually think of SPAM as supervised learning
 Hard for an algorithm to learn from positive examples when anomalies may look nothing like one another
 Application and why they're anomaly detection
 Fraud detection
 Many ways you may do fraud
 If you're a major on line retailer/very subject to attacks, sometimes might shift to supervised learning
 Manufacturing
 If you make HUGE volumes maybe have enough positive data > make supervised
 Means you make an assumption about the kinds of errors you're going to see
 It's the unknown unknowns we don't like!
 If you make HUGE volumes maybe have enough positive data > make supervised
 Monitoring machines in data
 Fraud detection
Supervised learning
 Reasonably large number of positive and negative examples
 Have enough positive examples to give your algorithm the opportunity to see what they look like
 If you expect anomalies to look anomalous in the same way
 Application
 Email/SPAM classification
 Weather prediction
 Cancer classification
Choosing What Features to Use
NonGaussian features
Plot a histogram of data to check it has a Gaussian description  nice sanity check
 Often still works if data is nonGaussian
NonGaussian data might look like this
Might take a log transformation of the data
i.e. if you have some feature x_{1}, replace it with log(x_{1})
Or do log(x_{1}+c)
 Play with c to make it look as Gaussian as possible
Or do x_{1}/2
Or do x_{1}/3
Error analysis for anomaly detection
 Like supervised learning error analysis procedure
 Run algorithm on CV set
 See which one it got wrong
 Develop new features based on trying to understand why the algorithm got those examples wrong
Example
p(x) large for normal, p(x) small for abnormal
Multivariate Gaussian Distribution(Optional)
Multivariate Gaussian Distribution
Model p(x) all in one go, instead of each feature separately
 What are the parameters for this new model?
 μ  which is an n dimensional vector (where n is number of features)
 Σ  which is an [n x n] matrix  the covariance matrix
 What are the parameters for this new model?
For the sake of completeness, the formula for the multivariate Gaussian distribution is as follows
= absolute value of Σ (determinant of sigma)

Anomaly detection algorithm with multivariate Gaussian distribution
Fit model  take data set and calculate μ and Σ using the formula above
We're next given a new example (x_{test})  see below
 For it compute p(x) using the following formula for multivariate distribution
 Compare the value with ε (threshold probability value)
 if p(x_{test}) < ε > flag this as an anomaly
 if p(x_{test}) >= ε > this is OK
If you fit a multivariate Gaussian model to our data we build something like this
 Which means it's likely to identify the green value as anomalous
Finally, we should mention how multivariate Gaussian relates to our original simple Gaussian model (where each feature is looked at individually)
Original model corresponds to multivariate Gaussian where the Gaussians' contours are axis aligned
i.e. the normal Gaussian model is a special case of multivariate Gaussian distribution
This can be shown mathematically
Has this constraint that the covariance matrix sigma as ZEROs on the nondiagonal values
If you plug your variance values into the covariance matrix the models are actually identical
Original model vs. Multivariate Gaussian
Original Gaussian model
Probably used more often
There is a need to manually create features to capture anomalies where x_{1}and
x_{2} take unusual combinations of values
 So need to make extra features
 Might not be obvious what they should be
 This is always a risk  where you're using your own expectation of a problem to "predict" future anomalies
 Typically, the things that catch you out aren't going to be the things you though of
 If you thought of them they'd probably be avoided in the first place
 Obviously this is a bigger issue, and one which may or may not be relevant depending on your problem space
Much cheaper computationally
Scales much better to very large feature vectors
 Even if n = 100 000 the original model works fine
Works well even with a small training set
 e.g. 50, 100
Because of these factors it's used more often because it really represents a optimized but axissymmetric specialization of the general model
Multivariate Gaussian model
 Used less frequently
 Can capture feature correlation
 So no need to create extra values
 Less computationally efficient
 Must compute inverse of matrix which is [n x n]
 So lots of features is bad  makes this calculation very expensive
 So if n = 100 000 not very good
 Needs for m > n
 i.e. number of examples must be greater than number of features
 If this is not true then we have a singular matrix (noninvertible)
 So should be used only in m >> n
 If you find the matrix is noninvertible, could be for one of two main reasons
 m < n
 So use original simple model
 Redundant features (i.e. linearly dependent)
 i.e. two features that are the same
 If this is the case you could use PCA or sanity check your data
 m < n
Recommender Systems
Predicting Movie Ratings
Problem Formulation
 You're a company who sells movies
 You let users rate movies using a 15 star rating
 To make the example nicer, allow 05 (makes math easier)
 You let users rate movies using a 15 star rating
 To introduce some notation
 n_{u}  Number of users (called ?nu occasionally as we can't subscript in superscript)
 n_{m} Number of movies
 r(i, j)  1 if user j has rated movie i (i.e. bitmap)
 y^{(i,j)}  rating given by user j to move i (defined only if r(i,j) = 1
So for this example
 n_{u} = 4
n_{m} = 5
 Summary of scoring
 Alice and Bob gave good ratings to rom coms, but low scores to action films
 Carol and Dave game good ratings for action films but low ratings for rom coms
 We have the data given above
 The problem is as follows
 Given r(i,j) and y^{(i,j)}  go through and try and predict missing values (?s)
 Come up with a learning algorithm that can fill in these missing values
Content Based Recommendations
For each movie we have a feature which measure degree to which each film is a
Romance (x_{1})
Action (x_{2})
If we have features like these, each film can be recommended by a feature vector
Add an extra feature which is x_{0} = 1 for each film
So for each film we have a [3 x 1] vector, which for film number 1 ("Love at Last") would be
i.e. for our dataset we have
 {x_{1}, x_{2}, x_{3}, x_{4}, x_{5}}
 Where each of these is a [3x1] vector with an x_{0} = 1 and then a romance and an action score
 {x_{1}, x_{2}, x_{3}, x_{4}, x_{5}}
To be consistent with our notation, n is going to be the number of features NOT counting the x_{0} term, so n = 2
We could treat each rating for each user as a separate linear regression problem
For each user j we could learn a parameter vector
Then predict that user j will rate movie i with
 (θ^{j})^{T} x^{i} = stars
 inner product of parameter vector and features
So, lets take user 1 (Alice) and see what she makes of the modern classic Cute Puppies of Love (CPOL)
We have some parameter vector (θ^{1}) associated with Alice
We'll explain later how we derived these values, but for now just take it that we have a vector
CPOL has a parameter vector (x^{3}) associated with it
Our prediction will be equal to
 (θ^{1})^{T} x^{3} = (0 * 1) + (5 * 0.99) + (0 * 0) = 4.95
 Which may seem like a reasonable value
 (θ^{1})^{T} x^{3} = (0 * 1) + (5 * 0.99) + (0 * 0) = 4.95
All we're doing here is applying a linear regression method for each user
 So we determine a future rating based on their interest in romance and action based on previous films
We should also add one final piece of notation
 m^{j},  Number of movies rated by the user (j)
How do we learn (θ^{j})
Create some parameters which give values as close as those seen in the data when applied
Sum over all values of i (all movies the user has used) when r(i,j) = 1 (i.e. all the films that the user has rated)
This is just like linear regression with leastsquared error
We can also add a regularization term to make our equation look as follows
 The regularization term goes from k=1 through to m, so (θ^{j}) ends up being an n+1 feature vector
 Don't regularize over the bias terms (0)
 The regularization term goes from k=1 through to m, so (θ^{j}) ends up being an n+1 feature vector
 If you do this you get a reasonable value
We're rushing through this a bit, but it's just a linear regression problem
To make this a little bit clearer you can get rid of the
m^{j} term (it's just a constant so shouldn't make any difference to minimization)
So to learn (θ^{j})
But for our recommender system we want to learn parameters for all users, so we add an extra summation term to this which means we determine the minimum (θ^{j}) value for every user
When you do this as a function of each (θ^{j}) parameter vector you get the parameters for each user
 So this is our optimization objective > J(θ^{1}, ..., θ^{nu})
In order to do the minimization we have the following gradient descent
Slightly different to our previous gradient descent implementations
k = 0 and k != 0 versions
We can define the middle term above as
Difference from linear regression
 No 1/m terms (got rid of the 1/m term)
 Otherwise very similar
This approach is called contentbased approach because we assume we have features regarding the content which will help us identify things that make them appealing to a user
 However, often such features are not available  next we discuss a noncontents based approach!
Collaborative Filtering
Collaborative Filtering
The collaborative filtering algorithm has a very interesting property  does feature learning
 i.e. it can learn for itself what features it needs to learn
So  let's change the problem and pretend we have a data set where we don't know any of the features associated with the films
Now let's make a different assumption
We've polled each user and found out how much each user likes
 Romantic films
 Action films
Which has generated the following parameter set
Alice and Bob like romance but hate action
Carol and Dave like action but hate romance
This is a bit of a simplification in terms of the maths, but what we're really asking is
 What feature vector should x^{1} be so that
 (θ^{1})^{T} x^{1} is about 5
 (θ^{2})^{T} x^{1} is about 5
 (θ^{3})^{T} x^{1} is about 0
 (θ^{4})^{T} x^{1} is about 0
From this we can guess that x1 may be
Using that same approach we should then be able to determine the remaining feature vectors for the other films
 What feature vector should x^{1} be so that
Formalizing the collaborative filtering problem
We can more formally describe the approach as follows
Given (θ^{1}, ..., θ^{nu}) (i.e. given the parameter vectors for each users' preferences)
We must minimize an optimization function which tries to identify the best parameter vector associated with a film
 So we're summing over all the indices j for where we have data for movie i
 We're minimizing this squared error
Like before, the above equation gives us a way to learn the features for one film
 We want to learn all the features for all the films  so we need an additional summation term
How does this work with the previous recommendation system
 Content based recommendation systems
 Saw that if we have a set of features for movie rating you can learn a user's preferences
 Now
 If you have your users preferences you can therefore determine a film's features
 This is a bit of a chicken & egg problem
 What you can do is
 Randomly guess values for θ
 Then use collaborative filtering to generate x
 Then use content based recommendation to improve θ
 Use that to improve x
 And so on
 This actually works
 Causes your algorithm to converge on a reasonable set of parameters
 This is collaborative filtering
 We call it collaborative filtering because in this example the users are collaborating together to help the algorithm learn better features and help the system and the other users
Collaborative filtering Algorithm
Here we combine the ideas from before to build a collaborative filtering algorithm
Our starting point is as follows
If we're given the film's features we can use that to work out the users' preference
If we're given the users' preferences we can use them to work out the film's features
One thing you could do is
 Randomly initialize parameter
 Go back and forward
But there's a more efficient algorithm which can solve θ and x simultaneously
Define a new optimization objective which is a function of x and θ
Understanding this optimization objective
The squared error term is the same as the squared error term in the two individual objectives above
 So it's summing over every movie rated by every users
 Note the ":" means, "for which"
 Sum over all pairs (i,j) for which r(i,j) is equal to 1
The regularization terms
 Are simply added to the end from the original two optimization functions
This newly defined function has the property that
 If you held x constant and only solved θ then you solve the, "Given x, solve θ" objective above
 Similarly, if you held θ constant you could solve x
In order to come up with just one optimization function we treat this function as a function of both film features x and user parameters θ
 Only difference between this in the backandforward approach is that we minimize with respect to both x and θ simultaneously
When we're learning the features this way
 Previously had a convention that we have an x_{0} = 1 term
 When we're using this kind of approach we have no x_{0},
 So now our vectors (both x and θ) are ndimensional (not n+1)
 We do this because we are now learning all the features so if the system needs a feature always = 1 then the algorithm can learn one
Algorithm Structure
 Initialize θ^{1}, ..., θ^{nu} and x^{1}, ..., x^{nm} to small random values
 A bit like neural networks  initialize all parameters to small random numbers
Minimize cost function (J(x^{1}, ..., x^{nm}, θ^{1}, ...,θ^{nu}) using gradient descent
 We find that the update rules look like this
 Where the top term is the partial derivative of the cost function with respect to x_{k}^{i} while the bottom is the partial derivative of the cost function with respect to θ_{k}^{i}
 So here we regularize EVERY parameters (no longer x_{0} parameter) so no special case update rule
Having minimized the values, given a user (user j) with parameters θ and movie (movie i) with learned features x, we predict a start rating of (θ_{j})^{T} x_{i}
 This is the collaborative filtering algorithm, which should give pretty good predictions for how users like new movies
Low Rank Matrix Factorization
Vectorization: Low Rank Matrix Factorization
We start by working out another way of writing out our predictions
So take all ratings by all users in our example above and group into a matrix Y
 5 movies 4 users Get a [5 x 4] matrix
Given [Y] there's another way of writing out all the predicted ratings
 With this matrix of predictive ratings
 We determine the (i,j) entry for EVERY movie
We can define another matrix X
Just like matrix we had for linear regression
Take all the features for each movie and stack them in rows
 Think of each movie as one example
Also define a matrix Θ
 Take each per user parameter vector and stack in rows
Given our new matrices X and θ
 We can have a vectorized way of computing the prediction range matrix by doing X * θ^{T}
We can given this algorithm another name  low rank matrix factorization
 This comes from the property that the X * θ^{T} calculation has a property in linear algebra that we create a low rank matrix
 Don't worry about what a low rank matrix is
 This comes from the property that the X * θ^{T} calculation has a property in linear algebra that we create a low rank matrix
Recommending new movies to a user
 Finally, having run the collaborative filtering algorithm, we can use the learned features to find related films
 When you learn a set of features you don't know what the features will be  lets you identify the features which define a film
 Say we learn the following features
 x_{1}  romance
 x_{2}  action
 x_{3} comedy
 x_{4}  ...
 So we have n features all together
 After you've learned features it's often very hard to come in and apply a human understandable metric to what those features are
 Usually learn features which are very meaning full for understanding what users like
 Say you have movie i
 Find movies j which is similar to i, which you can recommend
 Our features allow a good way to measure movie similarity
 If we have two movies x^{i} and x^{j}
 We want to minimize x^{i}  x^{j}
 i.e. the distance between those two movies
 We want to minimize x^{i}  x^{j}
 Provides a good indicator of how similar two films are in the sense of user perception
 NB  Maybe ONLY in terms of user perception
Implementational Detail: Mean Normalization
To show why we might need mean normalization let's consider an example where there's a user who hasn't rated any movies
 Lets see what the algorithm does for this user
 Say n = 2
 We now have to learn θ^{5} (which is an ndimensional vector)
 Looking in the first term of the optimization objective
 There are no films for which r(i,j) = 1
 So this term places no role in determining θ^{5}
 So we're just minimizing the final regularization term
 Lets see what the algorithm does for this user
Of course, if the goal is to minimize this term then
 Why  If there's no data to pull the values away from 0 this gives the min value
So this means we predict ANY movie to be zero
 Presumably Eve doesn't hate all movies...
 So if we're doing this we can't recommend any movies to her either
Mean normalization should let us fix this problem
How does mean normalization work?
Group all our ratings into matrix Y as before
We now have a column of ?s which corresponds to Eves rating
Now we compute the average rating each movie obtained and stored in an n_{m}  dimensional column vector
If we look at all the movie ratings in [Y] we can subtract off the mean rating
 Means we normalize each film to have an average rating of 0
Now, we take the new set of ratings and use it with the collaborative filtering algorithm
 Learn θ^{j} and x^{i} from the mean normalized ratings
For our prediction of user j on movie i, predict
(θ^{j})^{T} x^{i} + μ_{i}
 Where these vectors are the mean normalized values
 We have to add μ because we removed it from our θ values
So for user 5 the same argument applies, so
So on any movie i we're going to predict
 (θ^{5})^{T} x^{i} + μ_{i}
 Where (θ^{5})^{T} x^{i} + μ_{i} = to 0 (still)
 But we then add the mean (μ_{i}) which means Eve has an average rating assigned to each movie for here
 (θ^{5})^{T} x^{i} + μ_{i}
This makes sense
 If Eve hasn't rated any films, predict the average rating of the films based on everyone
 This is the best we can do
 If Eve hasn't rated any films, predict the average rating of the films based on everyone
As an aside  we spoke here about mean normalization for users with no ratings
 If you have some movies with no ratings you can also play with versions of the algorithm where you normalize the columns
 BUT this is probably less relevant  probably shouldn't recommend an unrated movie
To summarize, this shows how you do mean normalization preprocessing to allow your system to deal with users who have not yet made any ratings
 Means we recommend the user we know little about the best average rated products