Nowadays, anomaly detection algorithms (also known as outlier detection) are gaining popularity in the data mining world. Why? Simply because they catch those data points that are unusual for a given dataset.
Many techniques (like machine learning anomaly detection methods, time series, neural network anomaly detection techniques, supervised and unsupervised outlier detection algorithms and etc.) play a vital role in big data management and data science for detecting fraud or other abnormal events.
On this page:
- What is anomaly detection? Definition and types of anomalies.
- 5 top anomaly detection algorithms.
- List of other outlier detection techniques.
- Comparison chart – infographic in PDF
What Is Anomaly Detection?
Anomaly detection is a method used to detect something that doesn’t fit the normal behavior of a dataset. In other words, anomaly detection finds data points in a dataset that deviates from the rest of the data.
Those unusual things are called outliers, peculiarities, exceptions, surprise and etc.
Let’s say you possess a saving bank account and you mostly withdraw 5000 $. However, one day 20000 $ is withdrawn from your saving account.
This is a very unusual activity as mostly 5000 $ is deducted from your account. The transaction is abnormal for the bank. It is an outlier.
Different Types of Anomalies:
- Point anomalies – if a data point is too far from the rest, it falls into the category of point anomalies. The above example of bank transaction illustrates point anomalies.
- Contextual anomalies – If the event is anomalous for specific circumstances (context), then we have contextual anomalies. As data becomes more and more complex, it is vital to use anomaly detection methods for the context. This anomaly type is common in time-series data. Example – spending $10 on ice-cream every day during the hot months is normal, but is odd for the rest months.
- Collective anomalies. The collective anomaly denotes a collection of anomalous with respect to the whole dataset, but not individual objects. Example: breaking rhythm in ECG (Electrocardiogram).
Anomaly Detection Algorithms
Outliers and irregularities in data can usually be detected by different data mining algorithms. For example, algorithms for clustering, classification or association rule learning.
Generally, algorithms fall into two key categories – supervised and unsupervised learning. Supervised learning is the more common type. It includes such algorithms as logistic and linear regression, support vector machines, multi-class classification, and etc.
It is called supervised learning because the data scientist act as a teacher who teaches the algorithm what conclusions it should come up with. The data science supervises the learning process.
Supervised methods (also called classification methods) require a training set that includes both normal and anomalous examples to construct a predictive model.
On the other hand, unsupervised learning includes the idea that a computer can learn to discover complicated processes and outliers without a human to provide guidance.
Let’s see the some of the most popular anomaly detection algorithms.
1. K-nearest neighbor: k-NN
k-NN is one of the simplest supervised learning algorithms and methods in machine learning. It stores all of the available examples and then classifies the new ones based on similarities in distance metrics.
k-NN is a famous classification algorithm and a lazy learner. What does a lazy learner mean?
K-nearest neighbor mainly stores the training data. It doesn’t do anything else during the training process. That’ s why it is lazy.
k-NN just stores the labeled training data. When new unlabeled data arrives, kNN works in 2 main steps:
- Looks at the k closest training data points (the k-nearest neighbors).
- Then, as it uses the k-nearest neighbors, k-NN decides how the new data should be classified.
How does k-NN know what’s closer?
It uses density-based anomaly detection methods. For continuous data (see continuous vs discrete data), the most common distance measure is the Euclidean distance. For discrete data, Hamming distance is a popular metric for the “closeness” of 2 text strings.
The pick of distance metric depends on the data.
The k-NN algorithm works very well for dynamic environments where frequent updates are needed. In addition, density-based distance measures are good solutions for identifying unusual conditions and gradual trends. This makes k-NN useful for outlier detection and defining suspicious events.
k-NN also is very good techniques for creating models that involve non-standard data types like text.
k-NN is one of the proven anomaly detection algorithms that increase the fraud detection rate. It is also one of the most known text mining algorithms out there.
It has many applications in business and finance field. For example, k-NN helps for detecting and preventing credit card fraudulent transactions.
2. Local Outlier Factor (LOF)
The LOF is a key anomaly detection algorithm based on a concept of a local density. It uses the distance between the k nearest neighbors to estimate the density.
LOF compares the local density of an item to the local densities of its neighbors. Thus one can determine areas of similar density and items that have a significantly lower density than their neighbors. These are the outliers.
To put it in other words, the density around an outlier item is seriously different from the density around its neighbors.
That is why LOF is called a density-based outlier detection algorithm. In addition, as you see, LOF is the nearest neighbors technique as k-NN.
LOF is computed on the base of the average ratio of the local reachability density of an item and its k-nearest neighbors.
3. K-means
K-means is a very popular clustering algorithm in the data mining area. It creates k groups from a set of items so that the elements of a group are more similar.
Just to recall that cluster algorithms are designed to make groups where the members are more similar. In this term, clusters and groups are synonymous.
In K-means technique, data items are clustered depending on feature similarity.
One of the greatest benefits of k-means is that it is very easy to implement. K-means is successfully implemented in the most of the usual programming languages that data science uses.
If you are going to use k-means for anomaly detection, you should take in account some things:
- The user has to define the number of clusters in the early beginning.
- k-means suppose that each cluster has pretty equal numbers of observations.
- k-means only work with numerical data.
Is k-means supervised or unsupervised? It depends, but most data science specialists classify it as unsupervised. The reason is that, besides specifying the number of clusters, k-means “learns” the clusters on its own. k-means can be semi-supervised.
4. Support Vector Machine (SVM)
A support vector machine is also one of the most effective anomaly detection algorithms. SVM is a supervised machine learning technique mostly used in classification problems.
It uses a hyperplane to classify data into 2 different groups.
Just to recall that hyperplane is a function such as a formula for a line (e.g. y = nx + b).
SVM determine the best hyperplane that separates data into 2 classes.
To say it in another way, given labeled learning data, the algorithm produces an optimal hyperplane that categorizes the new examples.
When it comes to anomaly detection, the SVM algorithm clusters the normal data behavior using a learning area. Then, using the testing example, it identifies the abnormalities that go out of the learned area.
5. Neural Networks Based Anomaly Detection
When it comes to modern anomaly detection algorithms, we should start with neural networks.
Artificial neural networks are quite popular algorithms initially designed to mimic biological neurons.
The primary goal of creating a system of artificial neurons is to get systems that can be trained to learn some data patterns and execute functions like classification, regression, prediction and etc.
Building a recurrent neural network that discovers anomalies in time series data is a hot topic in data mining world today.
What makes them very helpful for anomaly detection in time series is this power to find out dependent features in multiple time steps.
There are many different types of neural networks and they have both supervised and unsupervised learning algorithms. Example of how neural networks can be used for anomaly detection, you can see here.
The above 5 anomaly detection algorithms are the key ones. However, there are other techniques. Here is a more comprehensive list of techniques and algorithms.
Nearest-neighbor based algorithms:
- k-NN
- Local Outlier Factor (LOF)
- Connectivity-based Outlier Factor (COF)
- Local Outlier Probability (LoOP)
- Influenced Outlierness (INFLO)
- Local Correlation Integral (LOCI)
Clustering based algorithms:
- Cluster based Local Outlier Factor (CBLOF)
- Local Density Cluster based Outlier Factor (LDCOF)
Statistics based techniques:
- Parametric techniques
- Non-parametric techniques
Classification based techniques:
- Decision Tree
- Neural Networks
- Bayesian Networks
- Rule-based.
The following comparison chart represents the advantages and disadvantages of the top anomaly detection algorithms. Download it here in PDF format.
Conclusion
Although there is a rising interest in anomaly detection algorithms, applications of outlier detection are still limited to areas like bank fraud, finance, health and medical diagnosis, errors in a text and etc.
However, in our growing data mining world, anomaly detection would likely to have a crucial role when it comes to monitoring and predictive maintenance.
Data scientists and machine learning engineers all over the world put a lot of efforts to analyze data and to use various kind of techniques that make data less vulnerable and more secure.
Silvia Vylcheva has more than 10 years of experience in the digital marketing world – which gave her a wide business acumen and the ability to identify and understand different customer needs.
Silvia has a passion and knowledge in different business and marketing areas such as inbound methodology, data intelligence, competition research and more.