Source Code, Datasets, and Comparative
Beyond One Billion Time Series:
Indexing and Mining Very Large Time Series Collections with iSAX2+
Alessandro Camerra, Jin Shieh, Themis Palpanas, Thanawin Rakthanmanon, Eamonn Keogh
There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than one-million time series.
In this paper, we describe iSAX 2.0 and its improvements, iSAX 2.0 Clustered and iSAX2+, three methods designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index.
We show how our methods allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections.
You may freely use this code for research purposes, provided that you properly acknowledge the authors using the following reference:
Alessandro Camerra, Jin Shieh, Themis Palpanas, Thanawin Rakthanmanon, Eamonn Keogh. Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. Knowledge and Information Systems (KAIS) Journal 39(1):123-151, 2014
- Zip file with source code for all the algorithms used in the paper. Note that you have to unzip the "isax-References.zip" file so that the resulting "References" directory is at the same level as the "SaxIndex" directory (for all three versions of the algorithm).
We produced a set of synthetic datasets with sizes from 100 million to 1 billion time series composed by random walks of length 256. Each data point in the time series is produced as xi+1=N(xi,1), where N(0,1) is a standard normal distribution.
We used three real datasets from diverse domains. The entomology and DNA datasets are contained in this zip file.
The web images dataset is available from:
A. Torralba, R. Fergus, W. T. Freeman (2008). 80 Million Tiny
Images: A Large Data Set for Nonparametric Object and Scene
Recognition. IEEE PAMI Vol. 30, No. 11. (2008), pp. 1958-1970.
The detailed results for all the experiments we report in the paper are listed in this spreadsheet.