Table of Contents
Fetching ...

Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, Leonardo Venuta

TL;DR

Problem: evaluating the scalability of approximate sparse retrieval algorithms for learned sparse embeddings on web-scale collections (MsMarco v2, 138M passages). Approach: compare Seismic and graph-based methods (PyAnn, GrassRMA, kANNolo HNSW) and Seismic with a $\kappa$-NN graph on Splade embeddings, across memory budgets up to $2\times$ the dataset size, and measure latency, accuracy, index size, and indexing time. Contributions: empirical scaling laws showing Seismic achieves substantially lower latency and faster indexing than sparse Hnsw, with κ-NN graph boosting high-accuracy recall at memory and construction cost. Significance: offers practical guidance for deploying scalable, approximate sparse retrieval at massive scale and points to future work in out-of-core and resource-limited deployments.

Abstract

Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.

Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets

TL;DR

Problem: evaluating the scalability of approximate sparse retrieval algorithms for learned sparse embeddings on web-scale collections (MsMarco v2, 138M passages). Approach: compare Seismic and graph-based methods (PyAnn, GrassRMA, kANNolo HNSW) and Seismic with a -NN graph on Splade embeddings, across memory budgets up to the dataset size, and measure latency, accuracy, index size, and indexing time. Contributions: empirical scaling laws showing Seismic achieves substantially lower latency and faster indexing than sparse Hnsw, with κ-NN graph boosting high-accuracy recall at memory and construction cost. Significance: offers practical guidance for deploying scalable, approximate sparse retrieval at massive scale and points to future work in out-of-core and resource-limited deployments.

Abstract

Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.
Paper Structure (5 sections, 2 figures, 1 table)

This paper contains 5 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Comparison of Hnsw and Seismic (with and without $\kappa$-NN graph) by accuracy at $k=10$ as a function of query latency. We allow hyperparameters that result in an index whose size is at most $1.5\times$ (left) or $2\times$ (right) the size of the dataset.
  • Figure 2: Scaling laws of Seismic and sparse Hnsw (as provided by the kANNolo library). For each accuracy cutoff, we measure the ratio between the latency of a method on MsMarco v2 and on MsMarco.