Table of Contents
Fetching ...

Dynamic Superblock Pruning for Fast Learned Sparse Retrieval

Parker Carlson, Wentai Xie, Shanxiu He, Tao Yang

TL;DR

The paper addresses the challenge of fast sparse retrieval with learned representations on CPU by introducing Dynamic Superblock Pruning (SP), which adds a superblock-level pruning layer to the traditional block-based traversal. SP precomputes per-block maxima $W_{B,t}$, per-superblock maxima $W_{X,t}$ and averages $ar{W}_{X,t}$ offline and uses online bounds $SBMax(X)=∑_{t∈Q} q_t W_{X,t}$ and $ar{SBMax}(X)=∑_{t∈Q} q_t ar{W}_{X,t}$ to prune superblocks with $SBMax(X)≤θ/μ$ and $ar{SBMax}(X)≤θ/η$, followed by block-level pruning via $BoundSum(B)≤θ/η$. During inference, unpruned blocks are scored in descending order of $BoundSum$, and a cache-optimized superblock-at-a-time traversal (SaaT) yields major speedups at the cost of additional storage for the superblock statistics. Empirical results on MS MARCO with SPLADE and E-SPLADE show SP substantially faster than state-of-the-art baselines (BMP, ASC, Seismic) under high-relevance budgets, with notable gains from the SaaT traversal and robust performance across parameter settings. The work highlights SP as a viable path to fast, high-relevance sparse retrieval on commodity CPUs and outlines directions for integrating SP with index compression and complementary pruning techniques.

Abstract

This paper proposes superblock pruning (SP) during top-k online document retrieval for learned sparse representations. SP structures the sparse index as a set of superblocks on a sequence of document blocks and conducts a superblock-level selection to decide if some superblocks can be pruned before visiting their child blocks. SP generalizes the previous flat block or cluster-based pruning, allowing the early detection of groups of documents that cannot or are less likely to appear in the final top-k list. SP can accelerate sparse retrieval in a rank-safe or approximate manner under a high-relevance competitiveness constraint. Our experiments show that the proposed scheme significantly outperforms state-of-the-art baselines on MS MARCO passages on a single-threaded CPU.

Dynamic Superblock Pruning for Fast Learned Sparse Retrieval

TL;DR

The paper addresses the challenge of fast sparse retrieval with learned representations on CPU by introducing Dynamic Superblock Pruning (SP), which adds a superblock-level pruning layer to the traditional block-based traversal. SP precomputes per-block maxima , per-superblock maxima and averages offline and uses online bounds and to prune superblocks with and , followed by block-level pruning via . During inference, unpruned blocks are scored in descending order of , and a cache-optimized superblock-at-a-time traversal (SaaT) yields major speedups at the cost of additional storage for the superblock statistics. Empirical results on MS MARCO with SPLADE and E-SPLADE show SP substantially faster than state-of-the-art baselines (BMP, ASC, Seismic) under high-relevance budgets, with notable gains from the SaaT traversal and robust performance across parameter settings. The work highlights SP as a viable path to fast, high-relevance sparse retrieval on commodity CPUs and outlines directions for integrating SP with index compression and complementary pruning techniques.

Abstract

This paper proposes superblock pruning (SP) during top-k online document retrieval for learned sparse representations. SP structures the sparse index as a set of superblocks on a sequence of document blocks and conducts a superblock-level selection to decide if some superblocks can be pruned before visiting their child blocks. SP generalizes the previous flat block or cluster-based pruning, allowing the early detection of groups of documents that cannot or are less likely to appear in the final top-k list. SP can accelerate sparse retrieval in a rank-safe or approximate manner under a high-relevance competitiveness constraint. Our experiments show that the proposed scheme significantly outperforms state-of-the-art baselines on MS MARCO passages on a single-threaded CPU.

Paper Structure

This paper contains 5 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Superblock and block pruning during traversal
  • Figure 2: Control flow for maximum score computation
  • Figure 3: Top: Total latency of SP and BMP when varying $b$ under safe pruning. Cost breakdown in block and superblock filtering, and in document scoring of each un-pruned block