Source Code, Datasets, and Comparative
Experimental Results
Identifying Streaming Frequent Items in Ad-Hoc Time Windows
(doi:10.1016/j.datak.2013.05.007)
Michele Dallachiesa and Themis Palpanas
The problem of frequent item discovery in streaming data has attracted a lot of attention, mainly because of its numerous applications in diverse domains, such as network traffic monitoring and e-business transactions analysis.
While the above problem has been studied extensively, and several techniques have been proposed for its solution, these approaches are geared towards the recent values in the stream. Nevertheless, in several situations the users would like to be able to query about the item frequencies in ad hoc windows in the stream history, and compare these values among themselves.
In this paper, we address the problem of finding frequent items in ad hoc windows in a data stream given a small bounded memory, and present novel algorithms to this direction. We propose basic sketch- and count-based algorithms that extend the functionality of existing approaches by monitoring item frequencies in the stream. Subsequently, we present an improved version of the algorithm with significantly better performance (in terms of accuracy, at no extra memory cost). Moreover, we propose an efficient non-linear model to better estimate the frequencies within the query windows.
Finally, we conduct an extensive experimental evaluation with synthetic and real datasets, which demonstrates the merits of the proposed solutions and provides guidelines for the practitioners in the field.
Journal Publication
Source Code
You may freely use this code for research purposes, provided that you
acknowledge the authors with the following reference:
Michele Dallachiesa, Themis Palpanas. Identifying Streaming Frequent Items in Ad-Hoc Time Windows. Data and Knowledge Engineering (DKE) 87, 2013: 66-90.
@ARTICLE{AdHocFrequentItemsDKE,
AUTHOR={Michele Dallachiesa and Themis Palpanas},
TITLE={{Identifying Streaming Frequent Items in Ad-Hoc Time Windows}},
JOURNAL={Data Knowl. Eng. (DKE)},
VOLUME={87},
PAGES={66-90},
YEAR=2013}
- Zip file with source code for all
the
algorithms used in the paper.
Synthetic Datasets
The synthetic datasets were generated according to a Zipfian
distribution. We generated datasets with the size, N, ranging between
10,000-100,000,000 items, item domain cardinality, M, 65,000-1,000,000,
and Zipf parameter, Z, 0.6-3.5. The parameters used in each run are
explicitly mentioned in the discussion of each experiment. We should
note that we generated several independent datasets for each particular
choice of the data parameters mentioned above, and repeated each
experiment for all these datasets.
Real Datasets
If you also use these datasets, include the proper acknowledgements
and references, as those appearing in our technical report.
- Zip file with dataset Kosarak.
- Zip file with dataset Retail.
- Zip file with dataset Q148.
- Zip file with dataset Nasa.