Source Code, Datasets
ParIS: The Next Destination for Fast Data Series Indexing and Query Answering
Botao Peng, Panagiota Fatourou, Themis Palpanas
We propose ParIS, the first disk-based data series index that inherently takes advantage of
modern hardware parallelization, in order to accelerate processing times.
Our experimental results demonstrate that ParIS
completely removes the CPU latency during index construction
for disk-resident data. In terms of exact query answering,
ParIS is more than 2 orders of magnitude faster
than the current state of the art index scan method,
and more than 3 orders of magnitude faster
than the optimized serial scan method.
ParIS owns its efficiency not only to the effective use of multi-core
and multi-socket architectures,
in order to distribute and execute in parallel both index construction and query answering,
but also to the exploitation of the Single Instruction Multiple Data (SIMD) capabilities
of modern CPUs, in order to further parallelize the execution of individual instructions inside each core.
You may freely use this code for research purposes, provided that you properly acknowledge the authors using the following reference:
Botao Peng, Panagiota Fatourou, Themis Palpanas. ParIS: The Next Destination for Fast Data Series Indexing and Query Answering. IEEE Big Data 2018.
- Zip file with source code for all the algorithms used in the paper will be made available after the acceptance of the paper (email the authors for the password).
We produced a set of synthetic datasets with sizes from 50 million to 250 million data series composed by random walks of length 256. Each data point in the data series is produced as xi+1=N(xi,1), where N(0,1) is a standard normal distribution.
The synthetic data generator code is included in the source code we are making available.
Our method was tested on two real datasets.
- For our first real dataset, Seismic, we used the IRIS Seismic Data Access repository to gather data series representing seismic waves from various locations.
We obtained 100 million data series of size 256.
The complete dataset size was 110 GB.
- The second real dataset, SALD, includes neuroscience MRI data.
The dataset comprised of 200 million data series of size 128.
The complete dataset size was 100GB.