Data Series Progressive Similarity Search with Probabilistic Quality Guarantees

Source Code, Datasets

Data Series Progressive Similarity Search with Probabilistic Quality Guarantees
Anna Gogolou, Theophanis Tsandilas, Karima Echihabi, Anastasia Bezerianos, Themis Palpanas

Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering. We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Experiments with synthetic and diverse real datasets demonstrate that our prediction methods constitute the first practical solution to the problem, significantly outperforming competing approaches.

Source Code

You may freely use this code for research purposes, provided that you properly acknowledge the authors using the following reference:

Anna Gogolou, Theophanis Tsandilas, Karima Echihabi, Anastasia Bezerianos, Themis Palpanas. Data Series Progressive Similarity Search with Probabilistic Quality Guarantees. ACM SIG International conference on Management of Data / Principles of Database Systems (SIGMOD/PODS), 2020.

Repository with source code for all the algorithms used in the paper.

Synthetic Datasets

We produced a set of synthetic datasets with sizes from 50 million to 200 million data series composed by random walks of length 256. Each data point in the data series is produced as xi+1=N(xi,1), where N(0,1) is a standard normal distribution.
The synthetic data generator code is included in the source code we are making available.

Real Datasets

Our method was tested on the following real datasets.

Seismic: we used the IRIS Seismic Data Access repository to gather data series representing seismic waves from various locations. We obtained 100 million data series of size 256. The complete dataset size was 100 GB.
SALD: includes neuroscience MRI data. The dataset comprised of 200 million data series of size 128. The complete dataset size was 100GB.
Deep1B: contains vectors extracted from the last layers of a convolutional neural network (computer vision). The dataset comprised 267 million vectors of size 96. The dataset size was 100GB.