Source Code, Datasets
Automated Anomaly Detection in Large Sequences
Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas
Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, the approaches that have been proposed so far in the literature have severe limitations: they either require prior domain knowledge that is used to design the anomaly discovery algorithms, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems, and propose NorM, a novel approach, suitable for domain-agnostic anomaly detection that is essentially parameter-free. NorM is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. Even though this is a simple idea, it involves challenges in both the identification of normal behavior, and in scalability. Our work is the first that uses this idea and addresses the above challenges in the context of subsequence anomaly detection. The experimental results on several real datasets demonstrate that the proposed approach correctly identifies all single and recurrent anomalies of various types, without any prior knowledge of the characteristics of these anomalies. Moreover, it outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, and at the same time gracefully scales on large datasets, being orders of magnitude faster.
Source Code
You may freely use this code for research purposes, provided that you properly acknowledge the authors using the following reference:
Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas. Automated Anomaly Detection in Large Sequences. IEEE International Conference on Data Engineering (ICDE), Dallas, TX, USA, April 2020.
- Zip file with source code for all the algorithms used in the paper will be made available after the acceptance of the paper (email the authors for the password).
Synthetic Datasets
We use several synthetic datasets that contain sinusoid patterns at fixed frequency following a random walk trend (Figure 7). We then inject different numbers of anomalies, in the form of sinusoid waveforms with different phases and higher than normal frequencies, and add various levels of Gaussian noise on top. We refer to those datasets using the label SRW-[# of anomalies]-[% of noise]-[length of anomaly], and use them in order to test the performance of the algorithms under different, controlled conditions.
The zip file also includes the synthetic data generator.
Real Datasets
The real datasets we use in our study are the following:
- Simulated engine disks data (SED) collected by the Rotary Dynamics Laboratory at NASA. This data series represents disk revolutions recorded over several runs (3K rpm speed).
- MIT-BIH Supraventricular Arrhythmia Database (MBA), which are electrocardiogram recordings from 5 different patients, containing multiple instances of two different kinds of anomalies.
- Five additional real datasets from various domains that have been studied in earlier works, and their anomalies are simple discords (usually only 1):
- aerospace engineering: Space Shuttle Marotta Valve
- gesture recognition: Ann’s Gun dataset
- medicine: Patient’s respiration measured by the thorax extension and ECG recordings qtb/sel102
- electrical consumption study: Dutch Power Consumption data
Two non-annotated datasets:
- The Nasa Bearing dataset that consists of individual files that are 1-second vibration signal snapshots of bearings installed on a shaft
- The New York City Taxi and Limousine Commissions dataset (NTC) that corresponds to the Number of New York City taxi passengers every, recorded every 30 minutes from July 2014 to January 2015.