Source Code, Datasets
ParIS+: Data Series Indexing on Multi-Core Architectures
Botao Peng, Panagiota Fatourou, Themis Palpanas
Data series similarity search is a core operation for several data series analysis applications across many different domains. Nevertheless, even state-of-the-art techniques cannot provide the time performance required for large data series collections. We propose ParIS and ParIS+, the first disk-based data series indices carefully designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments demonstrate that ParIS+ completely removes the CPU latency during index construction for disk-resident data, and for exact query answering is up to 1 order of magnitude faster than the current state of the art index scan method, and up to 3 orders of magnitude faster than the optimized serial scan method. ParIS+ (which is an evolution of the ADS+ index) owes its efficiency to the effective use of multi-core and multi-socket architectures, in order to distribute and execute in parallel both index construction and query answering, and to the exploitation of the Single Instruction Multiple Data (SIMD) capabilities of modern CPUs, in order to further parallelize the execution of instructions inside each core.
You may freely use this code for research purposes, provided that you properly acknowledge the authors using the following reference:
Botao Peng, Panagiota Fatourou, Themis Palpanas. ParIS+: Data Series Indexing on Multi-Core Architectures. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2020.
- Zip file with source code for all the algorithms used in the paper will be made available after the acceptance of the paper (email the authors for the password).
We produced a set of synthetic datasets with sizes from 50 million to 250 million data series composed by random walks of length 256. Each data point in the data series is produced as xi+1=N(xi,1), where N(0,1) is a standard normal distribution.
The synthetic data generator code is included in the source code we are making available.
Our method was tested on two real datasets.
- For our first real dataset, Seismic, we used the IRIS Seismic Data Access repository to gather data series representing seismic waves from various locations.
We obtained 100 million data series of size 256.
The complete dataset size was 110 GB.
- The second real dataset, SALD, includes neuroscience MRI data.
The dataset comprised of 200 million data series of size 128.
The complete dataset size was 100GB.