Table of Contents
Fetching ...

Learning Cluster Representatives for Approximate Nearest Neighbor Search

Thomas Vecchiato

TL;DR

This work tackles scalable high-accuracy ANN search in dense vector spaces by focusing on clustering-based indexing. It introduces learning cluster representatives via a simple linear routing function learned through Learning-to-Rank, replacing standard centroids with learnt vectors to improve top-1 and top-k retrieval. Empirical results across multiple datasets, embedding models, and clustering variants demonstrate substantial accuracy gains, with the linear routing approach offering favorable efficiency-accuracy trade-offs and easy production integration. The study also investigates nonlinearity, concluding that a linear routing function provides the best balance, and suggests promising future extensions to other distances, larger k, and query-aware clustering. The findings highlight a productive bridge between Learning-to-Rank and ANN, with practical implications for fast, accurate vector search in real-world systems.

Abstract

Developing increasingly efficient and accurate algorithms for approximate nearest neighbor search is a paramount goal in modern information retrieval. A primary approach to addressing this question is clustering, which involves partitioning the dataset into distinct groups, with each group characterized by a representative data point. By this method, retrieving the top-k data points for a query requires identifying the most relevant clusters based on their representatives -- a routing step -- and then conducting a nearest neighbor search within these clusters only, drastically reducing the search space. The objective of this thesis is not only to provide a comprehensive explanation of clustering-based approximate nearest neighbor search but also to introduce and delve into every aspect of our novel state-of-the-art method, which originated from a natural observation: The routing function solves a ranking problem, making the function amenable to learning-to-rank. The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search.

Learning Cluster Representatives for Approximate Nearest Neighbor Search

TL;DR

This work tackles scalable high-accuracy ANN search in dense vector spaces by focusing on clustering-based indexing. It introduces learning cluster representatives via a simple linear routing function learned through Learning-to-Rank, replacing standard centroids with learnt vectors to improve top-1 and top-k retrieval. Empirical results across multiple datasets, embedding models, and clustering variants demonstrate substantial accuracy gains, with the linear routing approach offering favorable efficiency-accuracy trade-offs and easy production integration. The study also investigates nonlinearity, concluding that a linear routing function provides the best balance, and suggests promising future extensions to other distances, larger k, and query-aware clustering. The findings highlight a productive bridge between Learning-to-Rank and ANN, with practical implications for fast, accurate vector search in real-world systems.

Abstract

Developing increasingly efficient and accurate algorithms for approximate nearest neighbor search is a paramount goal in modern information retrieval. A primary approach to addressing this question is clustering, which involves partitioning the dataset into distinct groups, with each group characterized by a representative data point. By this method, retrieving the top-k data points for a query requires identifying the most relevant clusters based on their representatives -- a routing step -- and then conducting a nearest neighbor search within these clusters only, drastically reducing the search space. The objective of this thesis is not only to provide a comprehensive explanation of clustering-based approximate nearest neighbor search but also to introduce and delve into every aspect of our novel state-of-the-art method, which originated from a natural observation: The routing function solves a ranking problem, making the function amenable to learning-to-rank. The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search.

Paper Structure

This paper contains 36 sections, 17 equations, 10 figures, 3 tables, 3 algorithms.

Figures (10)

  • Figure 1: Top-$k$ retrieval problem based on the representation of documents and queries through vectors.
  • Figure 2: A visual representation of the vector search problem within the $\mathbb{R}^3$ space. Given a collection of documents, they are transformed into vector representations using an encoder. During online search, a query $q$ is vectorized using the same encoder, and a distance function is applied to retrieve the top-$k$ similar documents. In practical applications, not only are the number of documents extremely large, but the vectors themselves are often represented by hundreds or thousands of dimensions.
  • Figure 3: Distance metrics applied to two vectors, represented by the gray and orange dots, in a two-dimensional space $\mathbb{R}^2$. (a) Manhattan distance, L$1$ norm; (b) Euclidean distance, L$2$ norm; (c) Cosine distance; (d) Inner Product distance.
  • Figure 4: Clustering-Based Approximate Nearest Neighbor (ANN) Search. The left-hand side presents a visual representation of the space partitioned into clusters, while the right-hand side illustrates the corresponding index structure. Both figures depict the process of computing the similarity between a query point and the cluster representative points.
  • Figure 5: A ranking system is proposed that, given a set of documents and a query as input, outputs the documents ordered by their relevance to the query. This relevance is obtained through a ranking model, which assigns a score to each document-query pair, reflecting the document's pertinence to the query.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Definition 2.1.1: Top-$k$ Retrieval
  • Definition 2.3.1: $k$-Nearest Neighbors search with L$1$ norm
  • Definition 2.3.2: $k$-Nearest Neighbors search with L$2$ norm
  • Definition 2.3.3: $k$-Maximum Cosine Similarity Search
  • Definition 2.3.4: $k$-Maximum Inner Product Search
  • Definition 3.1.1: KMeans problem