Outlier detection has been used for many decades to detect points that are considered “abnormal,” or which don’t fit a particular pattern. Because of its highly practical nature, outlier detection is used in many practical use cases. The most famous examples include the detection of (financial) fraud and the detection of ‘malicious’ chatter by intelligence agencies. Because outlier detection algorithms are so useful for any organization, this article explores common outlier detection techniques and their application to Big Data environments.
Outlier detection is a summary term for a broad spectrum of outlier detection techniques. Over the years, many different terminology has arisen that is similar in nature, such as novelty detection, anomaly detection, noise detection, deviation detection and exception mining. Although the focus of each term might differ slightly, many of the underlying techniques are fundamentally the same. In line with the terminology used in the Enterprise Big Data Framework, we will utilize the Grubbs definition of outlier detection, that was already first formulated in 1969.
An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.
As can be seen from the definition, there is an important (and at the same time confusing) element in this definition. An outlier “appears” to be different from other members in the data set. This important element means that outlier detection techniques can detect potential outliers, but that the final interpretation of the result is still a human exercises. This is as true today as it was in 1969.
There is however a significant difference between the times of Grubbs, and the times of today. The volume and velocity with which data is created and needs to be processes is larger than ever before. At any moment in time, millions of social media posts, messages, transactions and videos are created. As a result, outlier detection techniques need to have the capability to process data almost real-time. They need to showcase potential outliers immediately. Because in many cases, the information that outlier detection in Big Data provides, is time sensitive. The examples that we discussed earlier – such as credit card detection and malicious chatter – are strong examples of the time-sensitive nature of outlier detection results.
It is therefore not exaggerated to claim that Bid Data has changed the game of outlier detection. At the same time, Big Data has opened up a world of opportunities to get (more) value from outlier detection techniques. Because the larger the data set becomes, the more value outlier detection might bring. As an analogy, consider the (famous) task of finding a needle in a haystack. The larger the haystack becomes, the more value an outlier detection algorithm obtain.
As the value of outlier detection keeps increasing with the size of data that is created, more and more use cases of outlier detection in Big Data have started to arise. Some interesting (and money-making) examples have been provided in the overview below:
- Fraud detection – the detection of fraudulent credit card transactions, insurance claims, expense reports or financial information.
- Intrusion detection – the detection of unauthorized access or attempts to computer networks or systems.
- Activity monitoring – the detection of (malicious) phone calls, messages and other forms of chatter which provides intelligence about people with bad intentions.
- Quality control – the detection of production defects or product characteristics that do not fit the same standards as the other products that a company manufactures.
- Image analysis – the detection of changed imagery. This can be applied to medical scans to detect certain types of diseases, or satellite imagery to detect abnormal or changed patterns.
- Pharmaceutical research – the detection of medical or patient outliers that can advance pharmaceutical or medical research
The list above is by no means intended as an exhaustive list of use cases of outlier detection, but merely as an illustration how Big Data has significantly changed the spectrum. For all the examples above, the data sources have increased significantly in recent years. As stated before, the number or transactions, messages and images that is created every day is larger than ever before.
So what are the best outlier detection techniques to use on Big Data? And are any recommended algorithms that can best be utilized to solve particular outlier detection use cases. Unfortunately, there is not a straightforward answer to this question. The speed and accuracy of each algorithm depends on the source data that needs to be analysed, as well as the homogeneity of the source data.
However, this does not mean that there is not answer at all. Previous research has found that outlier detection algorithms can broadly be divided into five key categories. These categories can correspondingly be applied to the source data of the underlying problem.
1. Statistical Models
Statistical outlier detection algorithms form one of the earliest, and still one of the most widely used, techniques that are used for outlier detection. In most cases, statistically probability (or better formulated: statistical improbability) form the basis for detecting any outlier. In the most simple form, statistical models work on the basis, that a certain chance is so unlikely to happen, that it must be an outlier. In a very simple example, this is the technique that is applied to indicate outlier in box plots.
2. Proximity Based Techniques
Proximity based techniques are popular techniques for bi-variate and multi-variate data. They are relatively simple to implement, and work by detecting distance between data points. Within the proximity based techniques, k-Nearest Neighbour (k-NN) is by far the most widely used because of the simplicity of the underlying calculation. There is however an important drawback in using proximity based techniques for outlier detection. The runtime complexity is proportional to the size of the data. As a result, proximity based techniques will require (too) long processing times to process large quantities of Big Data.
3. Parametric Models
Parametric models for outlier detection overcome the problems of runtime complexity that proximity based models imposes. For that reason, they are extremely suitable for dealing with Big Data. Popular parametric models are the Minimum Volume Ellipsoid estimation (MVE) and Convex Peeling. In a parametric model, the runtime complexity only grows with the model size, and not with the size of the data. The only difficulty is that – by the definition of parametric models – the parameters need to be estimated. The user of the model should therefore now beforehand that the data will fit the model. As a result, parametric outlier detection models are highly suitable and accurate for specific problems, but they cannot be used for generic purposes, where the properties of the input data set are unknown.
4. Neural Networks
Neural network techniques for outlier detection introduce more modern technology to the identification and selection of outliers. Neural networks are non-parametric and model-based, and for that reason they are very suitable to be applied to “unseen” data patterns. In that sense, the can be utilized to address more generic cases and which the structure of the input data is unknown. Within the neural network outlier detection techniques, it is possible to use supervised techniques (classification) as well as unsupervised techniques (clustering). More information on Neural Network is discussed in this article.
The overview above gives an indication of the most used outlier detection techniques that are used in Big Data. It is fair to say that most of the current outlier detection techniques that are still used today, are based on the statistical, distance-based and parametric approaches. However, as the volume and velocity keeps increasing, Neural Network approaches, or a hybrid form that includes Neural networks, will form the basis for future outlier detection solutions.
 Hodge, V. and Austin, J., 2004. A survey of outlier detection methodologies. Artificial intelligence review, 22(2), pp.85-126
 Grubbs, F.E., 1969. Procedures for detecting outlying observations in samples. Technometrics, 11(1), pp.1-21.
 Van Aelst, S. and Rousseeuw, P., 2009. Minimum volume ellipsoid. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1), pp.71-82