Table of Contents
Fetching ...

A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search

Thomas Vecchiato, Claudio Lucchese, Franco Maria Nardini, Sebastian Bruch

TL;DR

This work reframes routing in clustering-based approximate nearest neighbor search as a ranking problem and shows that a lightweight linear routing function can be effectively learned via a cross-entropy surrogate aligned with $MRR$. By using oracle routing labels derived from exact search, the method trains a routing function $f(q)=W q$ to score partitions, enabling improved top-$ $ accuracy for Maximum Inner Product Search ($MIPS$). The approach generalizes to top-$k$ routing and demonstrates consistent gains across multiple datasets and clustering variants, with particularly large improvements when the probe budget is small. The findings suggest practical, production-friendly gains and motivate further integration of learning-to-rank techniques into both routing and clustering stages of ANN systems.

Abstract

A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of $k$ data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters -- a process known as routing -- then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top-$k$ configuration, the ground-truth is the set of clusters that contain the exact top-$k$ vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.

A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search

TL;DR

This work reframes routing in clustering-based approximate nearest neighbor search as a ranking problem and shows that a lightweight linear routing function can be effectively learned via a cross-entropy surrogate aligned with . By using oracle routing labels derived from exact search, the method trains a routing function to score partitions, enabling improved top- accuracy for Maximum Inner Product Search (). The approach generalizes to top- routing and demonstrates consistent gains across multiple datasets and clustering variants, with particularly large improvements when the probe budget is small. The findings suggest practical, production-friendly gains and motivate further integration of learning-to-rank techniques into both routing and clustering stages of ANN systems.

Abstract

A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters -- a process known as routing -- then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top- configuration, the ground-truth is the set of clusters that contain the exact top- vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.
Paper Structure (8 sections, 3 equations, 1 figure, 2 tables)

This paper contains 8 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Top-$10$ accuracy as a function of $\ell$ (expressed as percent of total number of partitions, $L$), on the all-MiniLM-L6-v2 embeddings. In all figures, the dashed lines indicate the baseline and the solid lines show the performance of the learnt routing function.