Table of Contents
Fetching ...

Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes

Jinrui Gou, Yifan Liu, Minghao Shao, Torsten Suel

TL;DR

This work tackles top-$k$ threshold estimation to accelerate disjunctive top-$k$ queries by seeking tight, safe lower-bound estimates of the $k$-th highest score. Building on quantile-based methods, it introduces enhancements such as multi-term subsets, duplicate-removal strategies, score combination, and selective lookups, augmented with sampling and budgeting controls to balance accuracy, speed, and space. The methods are evaluated on traditional indexes and learned sparse indexes (DocT5Query and DeepImpact), showing substantial MUF improvements, with lookups proving crucial for accuracy and performance gains, especially for longer queries and smaller $k$. The results demonstrate practical speedups in MaxScore and indicate the approach is viable for real-world search systems with limited resources, while also outlining future directions for space reduction and broader ranking-function integration.

Abstract

Top-k threshold estimation is the problem of estimating the score of the k-th highest ranking result of a search query. A good estimate can be used to speed up many common top-k query processing algorithms, and thus a number of researchers have recently studied the problem. Among the various approaches that have been proposed, quantile methods appear to give the best estimates overall at modest computational costs, followed by sampling-based methods in certain cases. In this paper, we make two main contributions. First, we study how to get even better estimates than the state of the art. Starting from quantile-based methods, we propose a series of enhancements that give improved estimates in terms of the commonly used mean under-prediction fraction (MUF). Second, we study the threshold estimation problem on recently proposed learned sparse index structures, showing that our methods also work well for these cases. Our best methods substantially narrow the gap between the state of the art and the ideal MUF of 1.0, at some additional cost in time and space.

Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes

TL;DR

This work tackles top- threshold estimation to accelerate disjunctive top- queries by seeking tight, safe lower-bound estimates of the -th highest score. Building on quantile-based methods, it introduces enhancements such as multi-term subsets, duplicate-removal strategies, score combination, and selective lookups, augmented with sampling and budgeting controls to balance accuracy, speed, and space. The methods are evaluated on traditional indexes and learned sparse indexes (DocT5Query and DeepImpact), showing substantial MUF improvements, with lookups proving crucial for accuracy and performance gains, especially for longer queries and smaller . The results demonstrate practical speedups in MaxScore and indicate the approach is viable for real-world search systems with limited resources, while also outlining future directions for space reduction and broader ranking-function integration.

Abstract

Top-k threshold estimation is the problem of estimating the score of the k-th highest ranking result of a search query. A good estimate can be used to speed up many common top-k query processing algorithms, and thus a number of researchers have recently studied the problem. Among the various approaches that have been proposed, quantile methods appear to give the best estimates overall at modest computational costs, followed by sampling-based methods in certain cases. In this paper, we make two main contributions. First, we study how to get even better estimates than the state of the art. Starting from quantile-based methods, we propose a series of enhancements that give improved estimates in terms of the commonly used mean under-prediction fraction (MUF). Second, we study the threshold estimation problem on recently proposed learned sparse index structures, showing that our methods also work well for these cases. Our best methods substantially narrow the gap between the state of the art and the ideal MUF of 1.0, at some additional cost in time and space.

Paper Structure

This paper contains 9 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Comparison of methods for k = 10 and k = 1000, for ClueWeb09B.
  • Figure 2: Comparing singles, pairs, triples, and quadruples for ClueWeb09B.
  • Figure 3: Comparing singles, pairs, triples, and quadruples for MSMARCO.
  • Figure 4: MUF of different lookup ratios for k = 10 on ClueWeb09B.
  • Figure 5: MUF of different prefix configurations for k = 10, 100, 1000.
  • ...and 4 more figures