Abstract

Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, the approaches that have been proposed so far in the literature have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems, and propose NormA, a novel approach, suitable for domain-agnostic anomaly detection. % that is essentially parameter-free. NormA is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach correctly identifies all single and recurrent anomalies of various types, with no prior knowledge of the characteristics of these anomalies (except for their length). Moreover, it outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, while being orders of magnitude faster.

P. Boniol, M. Linardi, F. Roncallo, T. Palpanas, M. Meftah, E. Remy, Unsupervised and Scalable Subsequence Anomaly Detection in Large Data Series, VLDBJ (2021)

P. Boniol, M. Linardi, F. Roncallo, T. Palpanas, Automated Anomaly Detection in Large Sequences, IEEE ICDE (2020)

P. Boniol, M. Linardi, F. Roncallo, T. Palpanas, SAD: An Unsupervised System for Subsequence Anomaly Detection, IEEE ICDE (2020)

Real Datasets

The real datasets we use in our study are the following:

Simulated engine disks data (SED) collected by the Rotary Dynamics Laboratory at NASA. This data series represents disk revolutions recorded over several runs (3K rpm speed).

MIT-BIH Supraventricular Arrhythmia Database (MBA), which are electrocardiogram recordings from 5 different patients, containing multiple instances of two different kinds of anomalies.

Five additional real datasets from various domains that have been studied in earlier works:

aerospace engineering: Space Shuttle Marotta Valve
gesture recognition: Ann’s Gun dataset
medicine: Patient’s respiration measured by the thorax extension and ECG recordings qtb/sel102
electrical consumption study: Dutch Power Consumption data

Synthetic Datasets

We use several synthetic datasets that contain sinusoid patterns at fixed frequency following a random walk trend (Figure 7). We then inject different numbers of anomalies, in the form of sinusoid waveforms with different phases and higher than normal frequencies, and add various levels of Gaussian noise on top. We refer to those datasets using the label SRW-[# of anomalies]-[% of noise]-[length of anomaly], and use them in order to test the performance of the algorithms under different, controlled conditions.

Source Code

Algorithms protected by patent. Code protected by copyright and provided as is. Email the authors for the password of the ZIP file. Users from the academia may use this code only for academic research purposes, provided that the authors are properly acknowledged by referencing the papers mentioned above. Users from the industry may test and evaluate this code by requesting a license.