Table of Contents
Fetching ...

Vector search with small radiuses

Gergely Szilvasy, Pierre-Emmanuel Mazaré, Matthijs Douze

TL;DR

The paper reframes vector search from top-k recall to range search under a post-verification budget, introducing the Range Search Metric (RSM) to quantify end-to-end usefulness without costly end-to-end evaluation. It builds a probabilistic model f(r) of positive matches given distance and estimates this via isotonic regression, enabling fast assessment of index and encoding choices in bulk settings. Through RunOfTheMill, based on YFCC100M, it demonstrates that optimal range-search performance often comes from near-query vectors and modest encodings, and that coarse quantization quality matters more than extremely accurate vector representations for range search. These findings provide practical guidelines for designing scalable range-search pipelines with efficient pre-filtering and budget-aware verification in real-world image retrieval tasks.

Abstract

In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs to be taken depending on the vector retrieval results, for example, deciding whether a query image matches a database image or not. We solve this as a range search task, where all vectors within a certain radius from the query are returned. We show that the value of a range search result can be modeled rigorously based on the query-to-vector distance. This yields a metric for range search, RSM, that is both principled and easy to compute without running an end-to-end evaluation. We apply this metric to the case of image retrieval. We show that indexing methods that are adapted for top-k retrieval do not necessarily maximize the RSM. In particular, for inverted file based indexes, we show that visiting a limited set of clusters and encoding vectors compactly yields near optimal results.

Vector search with small radiuses

TL;DR

The paper reframes vector search from top-k recall to range search under a post-verification budget, introducing the Range Search Metric (RSM) to quantify end-to-end usefulness without costly end-to-end evaluation. It builds a probabilistic model f(r) of positive matches given distance and estimates this via isotonic regression, enabling fast assessment of index and encoding choices in bulk settings. Through RunOfTheMill, based on YFCC100M, it demonstrates that optimal range-search performance often comes from near-query vectors and modest encodings, and that coarse quantization quality matters more than extremely accurate vector representations for range search. These findings provide practical guidelines for designing scalable range-search pipelines with efficient pre-filtering and budget-aware verification in real-world image retrieval tasks.

Abstract

In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs to be taken depending on the vector retrieval results, for example, deciding whether a query image matches a database image or not. We solve this as a range search task, where all vectors within a certain radius from the query are returned. We show that the value of a range search result can be modeled rigorously based on the query-to-vector distance. This yields a metric for range search, RSM, that is both principled and easy to compute without running an end-to-end evaluation. We apply this metric to the case of image retrieval. We show that indexing methods that are adapted for top-k retrieval do not necessarily maximize the RSM. In particular, for inverted file based indexes, we show that visiting a limited set of clusters and encoding vectors compactly yields near optimal results.
Paper Structure (34 sections, 10 equations, 7 figures, 1 table)

This paper contains 34 sections, 10 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Examples of image pairs from the dataset.
  • Figure 2: Distribution of the number of matching database vectors per query vector. The queries are sorted per decreasing number of results.
  • Figure 3: Number of positives found after the filtering stage (with the strict and relaxed settings) for two types of vector search: range search and k nearest neighbor (knn) search. For some points, we indicate the search radius ($r^2$, for range search) and the number of results per query ($k$, for knn search). The dotted lines are the estimated counts based on the RSM.
  • Figure 4: Distribution of distances between points for a uniform spherical distribution and a Gaussian distribution in dimensions $d=10$ and 100. The vector distributions are scaled so that the mode of each distance distribution is at 1.
  • Figure 5: Estimated positive probability for the RunOfTheMill dataset. The positive probabilities are obtained by isotonic regression.
  • ...and 2 more figures