Anomaly Detection with Gaussian Distribution

What is anomaly detection

In data mining, anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. - Wiki

Anomaly detection is basically to find anomalous data points in a dataset. This article is mainly about anomaly detection using Gaussian distribution.

First, we need to build a model using normal data, then use this model to calculate the probability (e.g. normal distribution) of the data points that you want to test. If this probability is less than a certain value $\varepsilon $, then we can judge it anomalous. See the following picture:

Besides the examples in the above quote, anomaly detection can also be used in other fields, such as fraud detection, quality assurance (Manufacturing), data center monitoring, etc.

For fraud detection, the features of data can be how often the users log in, the number of webpages visited, the number of posts of the user on the forum and typing speed of the user, etc.

How to do anomaly detection

Gaussian(Normal) probability density (distribution)

We can say that the curve is the distribution of probability $p(x;\mu ,{ \sigma }^{ 2 })$ parameterized by $\mu$ and ${ \sigma }^{ 2 }$

And given a dataset, we can estimate these two parameters.

Anomaly detection based on Gaussian distribution

We need to model $p(x)$ from a given dataset. Estimating the distribution of $p(x)$ is also called density estimation.

The above equation is based on an independence assumption of values of features ${x}_{1}$ through ${x}_{n}$. But in practice, even if the independence assumption doesn't hold true, this algorithm works well.

The whole algorithm can be summarized as follows:

To vectorize $\mu$:

$$\mu =\begin{bmatrix} { \mu }_{ 1 } \\ { \mu }_{ 2 } \\ \vdots \\ { \mu }_{ n } \end{bmatrix}=\frac { 1 }{ m } \sum _{ 1 }^{ m }{ { x }^{ (i) } }$$


The training set is comprised of all or almost all normal examples.

The above way of building and evaluating a model is really like supervised learning. So why do we just use a supervised learning algorithm? See the following picture:

Choosing what features to use

One thing that usually needs to be done is to plot the data or the histogram of the data to make sure that the data looks vaguely Gaussian before feeding it to the anomaly detection algorithm.

If you plot a histogram with the data and find it pretty non-gaussian. You need to transform the original data to new data using logarithm, square root, cube root, etc.

Even if you don't do the transformation, it won't make a big difference. But doing it will make it perform better.

If the model with present features cannot distinguish normal data and anomaly, we need to do error analysis and find new features that can make it work well.

The way is to choose features that might take on unusually large or small values in the event of an anomaly(e.g. CPU load, network traffic, or even the quotient of the former two).

Multivariate Gaussian (Normal) distribution

In the following situation, we can get both high values of $p({x}_{1})$ and $p({x}_{2})$ and their product is also high. Then we can get that data point $x$ is normal. But actually, it is not (disobeying the independence assumption).

So we need to use Multivariate Gaussian distribution. This is an extension of the earlier algorithm. We can use it to model correlations between the data.

Covariance matrix measures the variance or the variability of different features.

In the following middle graph, it reduces the variance of feature 1 and keeps the variance of feature 2.

It can show the correlations between different features.

Besides varying parameter $Sigma$, we can also vary parameter $\mu$.

So that's all the probability distributions that the Multivariate Gaussian Distribution allows you to capture.

The estimation of the two parameters:

Is there a relationship between this algorithm and the earlier algorithm? Yes, If the $Sigma$ is like the following case, the two algorithms are the same.

The comparison of original model and Multivariate Gaussian:

For Multivariate Gaussian, the reason why the condition of $m>n$ needs to be fulfilled is that the covariance matrix $Sigma$ is not invertible if it's not so. And as a rule of thumb, $m$ usually needs to be bigger than 10 times of $n$. To fulfil the condition (reduce $n$), usually, we can eliminate redundant features (i.e. linearly dependent features in linear algebra).