Abstract

The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data, including images, text, tables, graphs and data series in a variety of scientific and business domains. Finding similar vectors in these collections is at the core of many important and practical data science applications, such as information retrieval, clustering, recommendation, malware detection and data cleaning. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no-guarantees (ng) approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose Elpis, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. Elpis builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods on most scenarios, and answers 1-NN queries up to one order of magnitude faster. We demonstrate the scalability of Elpis on large real datasets from different domains, including computer vision, deep-network embeddings, neuroscience, and seismology, and share key insights useful for further developments and data science applications, supported by a thorough experimental evaluation.

Ilias Azizi, Karima Echihabi, Themis Palpanas. Elpis: Graph-Based Similarity Search for Scalable Data Science. Proceedings of the VLDB Endowment (PVLDB) Journal 16(6), (2023)

Elpis: Graph-Based Similarity Search for Scalable Data Science

Abstract

Supplementary Material

Source Code and Data