The Lernaean Hydra of Data Series Similarity Search: An Experimental Evaluation of the State of the Art
Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim
Data series are one of the most common types of data. The growing omnipresence of this type of data, thanks in part to the burgeoning IoT market, is now creating a data series deluge, which poses serious challenges and exciting opportunities. The main challenge is that conventional techniques, in particular data mining techniques, used by domain experts to exploit data series simply fail to scale up. Addressing this challenge can spur new business opportunities and lay the ground for new research breakthroughs. The conventional data mining techniques for data series rely heavily on similarity search. Several approaches for similarity search have been proposed in the literature, but none provides a detailed evaluation against competitors. The lack of clear comparative results is further exacerbated by the non-standardized usage of terminology (with different terms being used for the same meaning and a single term being used with different meanings), which has led to confusion and misconceptions. In this paper, we explore the area data series similarity search, provide definitions for the different flavors that have been studied in the literature, and present a comprehensive taxonomy of the corresponding methods. Moreover, we present the first systematic unbiased experimental evaluation of data series similarity search techniques. We conclude that none of the methods is the overall winner. Based on the experimental results, we describe the strengths and weaknesses of each approach and give recommendations for the best approach to use under typical use cases. Our findings lay the ground for solid further developments in the field of data series similarity search.
You may also be interested in our study for approximate similarity search.
You may use this code as is for research purposes, provided that you properly acknowledge the authors using the following reference: Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim. The Lernaean Hydra of Data Series Similarity Search: An Experimental Evaluation of the State of the Art. PVLDB 2019.
- Repository with the source code for all the algorithms used in the paper.
We produced a set of synthetic datasets with sizes from 50 million to 250 million data series composed by random walks of length 256. Each data point in the data series is produced as xi+1=N(xi,1), where N(0,1) is a standard normal distribution (we used seed 1184 for the datasets, and seed 14784 for the queries). The synthetic data generator code is included in the source code we are making available.
We used the following four real datasets. 1. Astro: represents lont-term variability of celestial objects at hard X-rays (astrophysics). The dataset comprised of 100 million data series of size 256. The complete dataset size was 100GB. 2. Seismic: we used the IRIS Seismic Data Access repository to gather data series representing seismic waves from various locations (seismology). We obtained 100 million data series of size 256. The complete dataset size was 100 GB. 3.SALD: includes brain MRI data (neuroscience). The dataset comprised of 200 million data series of size 128. The complete dataset size was 100GB. 4. Deep1B: contains vectors extracted from the last layers of a convolutional neural network (computer vision). The dataset comprised 267 million vectors of size 96. The complete dataset size was 100GB.
We provide the full set of experimental results.
All rigths reserved, 2018.