# Machine Learning Week9 Anomaly Detection

## Density Estimation

### Problem Motivation

• We have a dataset which contains normal(data)
• How we ensure they're normal is up to us
• In reality it's OK if there are a few which aren't actually normal
• Using that dataset as a reference point we can see if other examples are anomalous
• First, using our training dataset we build a model

• We can access this model using p(x)
• This asks, "What is the probability that example x is normal"
• Having built a model

• if p(xtest) < ε --> flag this as an anomaly
• if p(xtest) >= ε --> this is OK
• ε is some threshold probability value which we define, depending on how sure we need/want to be
• We expect our model to (graphically) look something like this

### Gaussian Distribution

• Also called the normal distribution

• Example

• Say x (data set) is made up of real numbers

• Mean is μ
• Variance is σ2
• σ is also called the standard deviation- specifies the width of the Gaussian probability
• The data has a Gaussian distribution
• Then we can write this ~N(μ,σ2 )

• ~ means = is distributed as
• N (should really be "script" N (even curlier!) -> means normal distribution
• μ, σ2 represent the mean and variance, respectively
• These are the two parameters a Gaussian means
• Looks like this;

• Gaussian equation is

• P(x : μ , σ2) (probability of x, parameterized by the mean and squared variance)

• Some examples of Gaussians below

• Area is always the same (must = 1)
• But width changes as standard deviation changes

• Estimating μ and σ2

• μ = average of examples

• σ2 = standard deviation squared

• As a side comment

• These parameters are the maximum likelihood estimation values for μ and σ2

• You can also do 1/(m) or 1/(m-1) doesn't make too much difference

• Slightly different mathematical problems, but in practice it makes little difference

### Algorithm

• Unlabeled training set of m examples
• Data = {x1, x2, ..., xm }

• Each example is an n-dimensional vector (i.e. a feature vector)
• Model P(x) from the data set

• What are high probability features and low probability features
• x is a vector
• So model p(x) as
• = p(x1; μ1 , σ12) * p(x2; μ2 , σ22) * ... p(xn; μn , σn2)
• Multiply the probability of each features by each feature
• We model each of the features by assuming each feature is distributed according to a Gaussian distribution
• p(xi; μi , σi2)
• The probability of feature xi given μi and σi2, using a Gaussian distribution
• As a side comment

• Turns out this equation makes an independence assumption for the features, although algorithm works if features are independent or not
• Don't worry too much about this, although if you're features are tightly linked you should be able to do some dimensionality reduction anyway!
• We can write this chain of multiplication more compactly as follows;

• The problem of estimation this distribution is sometimes call the problem of density estimation

• Algorithm

• 1 - Chose features
• Try to come up with features which might help identify something anomalous - may be unusually large or small values
• More generally, chose features which describe the general properties
• This is nothing unique to anomaly detection - it's just the idea of building a sensible feature vector
• 2 - Fit parameters
• Determine parameters for each of your examples μi and σi2
• Fit is a bit misleading, really should just be "Calculate parameters for 1 to n"
• So you're calculating standard deviation and mean for each feature
• You should of course used some vectorized implementation rather than a loop probably
• 3 - compute p(x)
• You compute the formula shown (i.e. the formula for the Gaussian probability)
• If the number is very small, very low chance of it being "normal"

## Building an Anomaly Detection System

### Developing and Evaluating an Anomaly Detection System

• Algorithm evaluation
• Take trainings set { x1, x2, ..., xm }
• Fit model p(x)
• On cross validation and test set, test the example x
• y = 1 if p(x) < epsilon (anomalous)
• y = 0 if p(x) >= epsilon (normal)
• Think of algorithm a trying to predict if something is anomalous
• But you have a label so can check!
• Makes it look like a supervised learning algorithm
• What's a good metric to use for evaluation
• y = 0 is very common
• So classification would be bad
• Compute fraction of true positives/false positive/false negative/true negative
• Compute precision/recall
• Compute F1-score
• Earlier, also had epsilon (the threshold value)
• Threshold to show when something is anomalous
• If you have CV set you can see how varying epsilon effects various evaluation metrics
• Then pick the value of epsilon which maximizes the score on your CV set
• Evaluate algorithm using cross validation
• Do final algorithm evaluation on the test set

### Anomaly Detection vs. Supervised Learning

#### Anomaly detection

• Very small number of positive examples
• Save positive examples just for CV and test set
• Consider using an anomaly detection algorithm
• Not enough data to "learn" positive examples
• Have a very large number of negative examples
• Use these negative examples for p(x) fitting
• Only need negative examples for this
• Many "types" of anomalies
• Hard for an algorithm to learn from positive examples when anomalies may look nothing like one another
• So anomaly detection doesn't know what they look like, but knows what they don't look like
• When we looked at SPAM email,
• Many types of SPAM
• For the spam problem, usually enough positive examples
• So this is why we usually think of SPAM as supervised learning
• Application and why they're anomaly detection
• Fraud detection
• Many ways you may do fraud
• If you're a major on line retailer/very subject to attacks, sometimes might shift to supervised learning
• Manufacturing
• If you make HUGE volumes maybe have enough positive data -> make supervised
• Means you make an assumption about the kinds of errors you're going to see
• It's the unknown unknowns we don't like!
• Monitoring machines in data

#### Supervised learning

• Reasonably large number of positive and negative examples
• Have enough positive examples to give your algorithm the opportunity to see what they look like
• If you expect anomalies to look anomalous in the same way
• Application
• Email/SPAM classification
• Weather prediction
• Cancer classification

### Choosing What Features to Use

• Non-Gaussian features

• Plot a histogram of data to check it has a Gaussian description - nice sanity check

• Often still works if data is non-Gaussian
• Non-Gaussian data might look like this

• Might take a log transformation of the data

• i.e. if you have some feature x1, replace it with log(x1)

• Or do log(x1+c)

• Play with c to make it look as Gaussian as possible
• Or do x1/2

• Or do x1/3

#### Error analysis for anomaly detection

• Like supervised learning error analysis procedure
• Run algorithm on CV set
• See which one it got wrong
• Develop new features based on trying to understand why the algorithm got those examples wrong
• Example

• p(x) large for normal, p(x) small for abnormal

## Multivariate Gaussian Distribution(Optional)

### Multivariate Gaussian Distribution

• Model p(x) all in one go, instead of each feature separately

• What are the parameters for this new model?
• μ - which is an n dimensional vector (where n is number of features)
• Σ - which is an [n x n] matrix - the covariance matrix
• For the sake of completeness, the formula for the multivariate Gaussian distribution is as follows

• = absolute value of Σ (determinant of sigma)

-

### Anomaly detection algorithm with multivariate Gaussian distribution

1. Fit model - take data set and calculate μ and Σ using the formula above

2. We're next given a new example (xtest) - see below

• For it compute p(x) using the following formula for multivariate distribution

3. Compare the value with ε (threshold probability value)
• if p(xtest) < ε --> flag this as an anomaly
• if p(xtest) >= ε --> this is OK
• If you fit a multivariate Gaussian model to our data we build something like this

• Which means it's likely to identify the green value as anomalous
• Finally, we should mention how multivariate Gaussian relates to our original simple Gaussian model (where each feature is looked at individually)

• Original model corresponds to multivariate Gaussian where the Gaussians' contours are axis aligned

• i.e. the normal Gaussian model is a special case of multivariate Gaussian distribution

• This can be shown mathematically

• Has this constraint that the covariance matrix sigma as ZEROs on the non-diagonal values

• If you plug your variance values into the covariance matrix the models are actually identical

#### Original model vs. Multivariate Gaussian

Original Gaussian model

• Probably used more often

• There is a need to manually create features to capture anomalies where x1and

x2 take unusual combinations of values

• So need to make extra features
• Might not be obvious what they should be
• This is always a risk - where you're using your own expectation of a problem to "predict" future anomalies
• Typically, the things that catch you out aren't going to be the things you though of
• If you thought of them they'd probably be avoided in the first place
• Obviously this is a bigger issue, and one which may or may not be relevant depending on your problem space
• Much cheaper computationally

• Scales much better to very large feature vectors

• Even if n = 100 000 the original model works fine
• Works well even with a small training set

• e.g. 50, 100
• Because of these factors it's used more often because it really represents a optimized but axis-symmetric specialization of the general model

Multivariate Gaussian model

• Used less frequently
• Can capture feature correlation
• So no need to create extra values
• Less computationally efficient
• Must compute inverse of matrix which is [n x n]
• So lots of features is bad - makes this calculation very expensive
• So if n = 100 000 not very good
• Needs for m > n
• i.e. number of examples must be greater than number of features
• If this is not true then we have a singular matrix (non-invertible)
• So should be used only in m >> n
• If you find the matrix is non-invertible, could be for one of two main reasons
• m < n
• So use original simple model
• Redundant features (i.e. linearly dependent)
• i.e. two features that are the same
• If this is the case you could use PCA or sanity check your data

# Recommender Systems

## Predicting Movie Ratings

### Problem Formulation

• You're a company who sells movies
• You let users rate movies using a 1-5 star rating
• To make the example nicer, allow 0-5 (makes math easier)

• To introduce some notation
• nu - Number of users (called ?nu occasionally as we can't subscript in superscript)
• nm- Number of movies
• r(i, j) - 1 if user j has rated movie i (i.e. bitmap)
• y(i,j) - rating given by user j to move i (defined only if r(i,j) = 1
• So for this example

• nu = 4
• nm = 5

• Summary of scoring
• Alice and Bob gave good ratings to rom coms, but low scores to action films
• Carol and Dave game good ratings for action films but low ratings for rom coms
• We have the data given above
• The problem is as follows
• Given r(i,j) and y(i,j) - go through and try and predict missing values (?s)
• Come up with a learning algorithm that can fill in these missing values

### Content Based Recommendations

• For each movie we have a feature which measure degree to which each film is a

• Romance (x1)

• Action (x2)

• If we have features like these, each film can be recommended by a feature vector

• Add an extra feature which is x0 = 1 for each film

• So for each film we have a [3 x 1] vector, which for film number 1 ("Love at Last") would be

• i.e. for our dataset we have

• {x1, x2, x3, x4, x5}
• Where each of these is a [3x1] vector with an x0 = 1 and then a romance and an action score
• To be consistent with our notation, n is going to be the number of features NOT counting the x0 term, so n = 2

• We could treat each rating for each user as a separate linear regression problem

• For each user j we could learn a parameter vector

• Then predict that user j will rate movie i with

• j)T xi = stars
• inner product of parameter vector and features
• So, lets take user 1 (Alice) and see what she makes of the modern classic Cute Puppies of Love (CPOL)

• We have some parameter vector (θ1) associated with Alice

• We'll explain later how we derived these values, but for now just take it that we have a vector

• CPOL has a parameter vector (x3) associated with it

• Our prediction will be equal to

• 1)T x3 = (0 * 1) + (5 * 0.99) + (0 * 0) = 4.95
• Which may seem like a reasonable value
• All we're doing here is applying a linear regression method for each user

• So we determine a future rating based on their interest in romance and action based on previous films
• We should also add one final piece of notation

• mj, - Number of movies rated by the user (j)

#### How do we learn (θj)

• Create some parameters which give values as close as those seen in the data when applied

• Sum over all values of i (all movies the user has used) when r(i,j) = 1 (i.e. all the films that the user has rated)

• This is just like linear regression with least-squared error

• We can also add a regularization term to make our equation look as follows

• The regularization term goes from k=1 through to m, so (θj) ends up being an n+1 feature vector
• Don't regularize over the bias terms (0)
• If you do this you get a reasonable value
• We're rushing through this a bit, but it's just a linear regression problem

• To make this a little bit clearer you can get rid of the

mj term (it's just a constant so shouldn't make any difference to minimization)

• So to learn (θj)

• But for our recommender system we want to learn parameters for all users, so we add an extra summation term to this which means we determine the minimum (θj) value for every user

• When you do this as a function of each (θj) parameter vector you get the parameters for each user

• So this is our optimization objective -> J(θ1, ..., θnu)
• In order to do the minimization we have the following gradient descent

• Slightly different to our previous gradient descent implementations

• k = 0 and k != 0 versions

• We can define the middle term above as

• Difference from linear regression

• No 1/m terms (got rid of the 1/m term)
• Otherwise very similar
• This approach is called content-based approach because we assume we have features regarding the content which will help us identify things that make them appealing to a user

• However, often such features are not available - next we discuss a non-contents based approach!

## Collaborative Filtering

### Collaborative Filtering

• The collaborative filtering algorithm has a very interesting property - does feature learning

• i.e. it can learn for itself what features it needs to learn
• So - let's change the problem and pretend we have a data set where we don't know any of the features associated with the films

• Now let's make a different assumption

• We've polled each user and found out how much each user likes

• Romantic films
• Action films
• Which has generated the following parameter set

• Alice and Bob like romance but hate action

• Carol and Dave like action but hate romance

• This is a bit of a simplification in terms of the maths, but what we're really asking is

• What feature vector should x1 be so that
• 1)T x1 is about 5
• 2)T x1 is about 5
• 3)T x1 is about 0
• 4)T x1 is about 0
• From this we can guess that x1 may be

• Using that same approach we should then be able to determine the remaining feature vectors for the other films

#### Formalizing the collaborative filtering problem

• We can more formally describe the approach as follows

• Given (θ1, ..., θnu) (i.e. given the parameter vectors for each users' preferences)

• We must minimize an optimization function which tries to identify the best parameter vector associated with a film

• So we're summing over all the indices j for where we have data for movie i
• We're minimizing this squared error
• Like before, the above equation gives us a way to learn the features for one film

• We want to learn all the features for all the films - so we need an additional summation term

#### How does this work with the previous recommendation system

• Content based recommendation systems
• Saw that if we have a set of features for movie rating you can learn a user's preferences
• Now
• If you have your users preferences you can therefore determine a film's features
• This is a bit of a chicken & egg problem
• What you can do is
• Randomly guess values for θ
• Then use collaborative filtering to generate x
• Then use content based recommendation to improve θ
• Use that to improve x
• And so on
• This actually works
• Causes your algorithm to converge on a reasonable set of parameters
• This is collaborative filtering
• We call it collaborative filtering because in this example the users are collaborating together to help the algorithm learn better features and help the system and the other users

### Collaborative filtering Algorithm

• Here we combine the ideas from before to build a collaborative filtering algorithm

• Our starting point is as follows

• If we're given the film's features we can use that to work out the users' preference

• If we're given the users' preferences we can use them to work out the film's features

• One thing you could do is

• Randomly initialize parameter
• Go back and forward
• But there's a more efficient algorithm which can solve θ and x simultaneously

• Define a new optimization objective which is a function of x and θ

• Understanding this optimization objective

• The squared error term is the same as the squared error term in the two individual objectives above

• So it's summing over every movie rated by every users
• Note the ":" means, "for which"
• Sum over all pairs (i,j) for which r(i,j) is equal to 1
• The regularization terms

• Are simply added to the end from the original two optimization functions
• This newly defined function has the property that

• If you held x constant and only solved θ then you solve the, "Given x, solve θ" objective above
• Similarly, if you held θ constant you could solve x
• In order to come up with just one optimization function we treat this function as a function of both film features x and user parameters θ

• Only difference between this in the back-and-forward approach is that we minimize with respect to both x and θ simultaneously
• When we're learning the features this way

• Previously had a convention that we have an x0 = 1 term
• When we're using this kind of approach we have no x0,
• So now our vectors (both x and θ) are n-dimensional (not n+1)
• We do this because we are now learning all the features so if the system needs a feature always = 1 then the algorithm can learn one

#### Algorithm Structure

1. Initialize θ1, ..., θnu and x1, ..., xnm to small random values
• A bit like neural networks - initialize all parameters to small random numbers
1. Minimize cost function (J(x1, ..., xnm, θ1, ...,θnu) using gradient descent

• We find that the update rules look like this

• Where the top term is the partial derivative of the cost function with respect to xki while the bottom is the partial derivative of the cost function with respect to θki
• So here we regularize EVERY parameters (no longer x0 parameter) so no special case update rule
2. Having minimized the values, given a user (user j) with parameters θ and movie (movie i) with learned features x, we predict a start rating of (θj)T xi

• This is the collaborative filtering algorithm, which should give pretty good predictions for how users like new movies

## Low Rank Matrix Factorization

### Vectorization: Low Rank Matrix Factorization

• We start by working out another way of writing out our predictions

• So take all ratings by all users in our example above and group into a matrix Y

• 5 movies 4 users Get a [5 x 4] matrix
• Given [Y] there's another way of writing out all the predicted ratings

• With this matrix of predictive ratings
• We determine the (i,j) entry for EVERY movie
• We can define another matrix X

• Just like matrix we had for linear regression

• Take all the features for each movie and stack them in rows

• Think of each movie as one example
• Also define a matrix Θ

• Take each per user parameter vector and stack in rows
• Given our new matrices X and θ

• We can have a vectorized way of computing the prediction range matrix by doing X * θT
• We can given this algorithm another name - low rank matrix factorization

• This comes from the property that the X * θT calculation has a property in linear algebra that we create a low rank matrix
• Don't worry about what a low rank matrix is

#### Recommending new movies to a user

• Finally, having run the collaborative filtering algorithm, we can use the learned features to find related films
• When you learn a set of features you don't know what the features will be - lets you identify the features which define a film
• Say we learn the following features
• x1 - romance
• x2 - action
• x3- comedy
• x4 - ...
• So we have n features all together
• After you've learned features it's often very hard to come in and apply a human understandable metric to what those features are
• Usually learn features which are very meaning full for understanding what users like
• Say you have movie i
• Find movies j which is similar to i, which you can recommend
• Our features allow a good way to measure movie similarity
• If we have two movies xi and xj
• We want to minimize ||xi - xj||
• i.e. the distance between those two movies
• Provides a good indicator of how similar two films are in the sense of user perception

### Implementational Detail: Mean Normalization

• To show why we might need mean normalization let's consider an example where there's a user who hasn't rated any movies

• Lets see what the algorithm does for this user
• Say n = 2
• We now have to learn θ5 (which is an n-dimensional vector)
• Looking in the first term of the optimization objective
• There are no films for which r(i,j) = 1
• So this term places no role in determining θ5
• So we're just minimizing the final regularization term

• Of course, if the goal is to minimize this term then

• Why - If there's no data to pull the values away from 0 this gives the min value
• So this means we predict ANY movie to be zero

• Presumably Eve doesn't hate all movies...
• So if we're doing this we can't recommend any movies to her either
• Mean normalization should let us fix this problem

#### How does mean normalization work?

• Group all our ratings into matrix Y as before

• We now have a column of ?s which corresponds to Eves rating

• Now we compute the average rating each movie obtained and stored in an nm - dimensional column vector

• If we look at all the movie ratings in [Y] we can subtract off the mean rating

• Means we normalize each film to have an average rating of 0
• Now, we take the new set of ratings and use it with the collaborative filtering algorithm

• Learn θj and xi from the mean normalized ratings
• For our prediction of user j on movie i, predict

• j)T xi + μi

• Where these vectors are the mean normalized values
• We have to add μ because we removed it from our θ values
• So for user 5 the same argument applies, so

• So on any movie i we're going to predict

• 5)T xi + μi
• Where (θ5)T xi + μi = to 0 (still)
• But we then add the mean (μi) which means Eve has an average rating assigned to each movie for here
• This makes sense

• If Eve hasn't rated any films, predict the average rating of the films based on everyone
• This is the best we can do
• As an aside - we spoke here about mean normalization for users with no ratings

• If you have some movies with no ratings you can also play with versions of the algorithm where you normalize the columns
• BUT this is probably less relevant - probably shouldn't recommend an unrated movie
• To summarize, this shows how you do mean normalization preprocessing to allow your system to deal with users who have not yet made any ratings

• Means we recommend the user we know little about the best average rated products