Source Code, Datasets, and Comparative
Experimental Results
Frequent Items in Streaming Data:
An Experimental Evaluation of the State-of-the-Art
(doi:10.1016/j.datak.2008.11.001)
Nishad Manerikar and Themis Palpanas
The problem of detecting frequent items in streaming data is relevant to many different applications across
many domains. Several algorithms, diverse in nature, have been proposed in the literature for the solution
of the above problem. In this paper, we review these algorithms, and we present the results of the first
extensive comparative experimental study of the most prominent algorithms in the literature. The algorithms
were comprehensively tested using a common test framework on several real and synthetic datasets. Their
performance with respect to the different parameters (i.e., parameters intrinsic to the algorithms, and data
related parameters) was studied. We report the results, and insights gained through these experiments.
Journal Publication
Source Code
You may freely use this code for research purposes, provided that you
acknowledge the authors with the following reference:
Nishad Manerikar, Themis Palpanas. Frequent
Items in Streaming Data: An Experimental Evaluation of the
State-of-the-Art. Data and Knowledge Engineering (DKE) 68(4), 2009: 415-430.
@ARTICLE{dbTrentoFrequentItemsDKE,
AUTHOR={Nishad Manerikar and Themis Palpanas},
TITLE={{Frequent Items in Streaming Data An
Experimental Evaluation of the State-of-the-Art}},
JOURNAL={Data Knowl. Eng. (DKE)},
VOLUME={68},
NUMBER={4},
PAGES={415-430},
YEAR=2009}
- Zip file with source code for all
the
algorithms used in the paper.
Synthetic Datasets
The synthetic datasets were generated according to a Zipfian
distribution. We generated datasets with the size, N, ranging between
10,000-100,000,000 items, item domain cardinality, M, 65,000-1,000,000,
and Zipf parameter, Z, 0.6-3.5. The parameters used in each run are
explicitly mentioned in the discussion of each experiment. We should
note that we generated several independent datasets for each particular
choice of the data parameters mentioned above, and repeated each
experiment for all these datasets.
Real Datasets
If you also use these datasets, include the proper acknowledgements
and references, as those appearing in our technical report.
- Zip file with dataset Kosarak.
- Zip file with dataset Retail.
- Zip file with dataset Q148.
- Zip file with dataset Nasa.