Table of Contents
Fetching ...

Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion

Samuel McCauley

TL;DR

The paper tackles the space-efficiency bottleneck of high-dimensional approximate nearest neighbor search by applying function inversion to locality-sensitive hashing (LSH). It first presents a black-box method that, for any LSH family with parameters $p_1,p_2$ and dimension-dependent constants, reduces space to $\widetilde{O}(n^{1+\rho-s})$ while incurring $\widetilde{O}(T n^{\rho+3s})$ query time, for a space-saving parameter $s<\rho$, thus broadening the applicability of space-efficient ANN. Building on this, the authors integrate function inversion with the near-linear-space ALRW framework to obtain improved Euclidean and Manhattan ANN bounds, achieving a query-time exponent $\alpha(c)=\frac{2c^2-1}{c^4}\left(1-\frac{(c^2-1)^2}{4c^4+(c^2-1)^2}\right)$ that improves upon ALRW’s $\alpha_{ALRW}(c)=(2c^2-1)/c^4$ for many $c$, while maintaining near-linear space and modest preprocessing costs. The work also demonstrates that list-of-points data structures are not optimal under Euclidean or Manhattan ANN, and extends the approach to a Manhattan-ANN bound via a standard embedding. Overall, the results deliver black-box space improvements and faster query times for high-dimensional ANN, with implications for broader use of implicit data structures and function-inversion techniques in similarity search.

Abstract

Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional point sets are obtained using techniques based on locality-sensitive hashing (LSH). Unfortunately, space efficiency is a major challenge for LSH-based data structures. Classic LSH techniques require a very large amount of space, oftentimes polynomial in |S|. A long line of work has developed intricate techniques to reduce this space usage, but these techniques suffer from downsides: they must be hand tailored to each specific LSH, are often complicated, and their space reduction comes at the cost of significantly increased query times. In this paper we explore a new way to improve the space efficiency of LSH using function inversion techniques, originally developed in (Fiat and Naor 2000). We begin by describing how function inversion can be used to improve LSH data structures. This gives a fairly simple, black box method to reduce LSH space usage. Then, we give a data structure that leverages function inversion to improve the query time of the best known near-linear space data structure for approximate nearest neighbor search under Euclidean distance: the ALRW data structure of (Andoni, Laarhoven, Razenshteyn, and Waingarten 2017). ALRW was previously shown to be optimal among "list-of-points" data structures for both Euclidean and Manhattan ANN; thus, in addition to giving improved bounds, our results imply that list-of-points data structures are not optimal for Euclidean or Manhattan ANN.

Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion

TL;DR

The paper tackles the space-efficiency bottleneck of high-dimensional approximate nearest neighbor search by applying function inversion to locality-sensitive hashing (LSH). It first presents a black-box method that, for any LSH family with parameters and dimension-dependent constants, reduces space to while incurring query time, for a space-saving parameter , thus broadening the applicability of space-efficient ANN. Building on this, the authors integrate function inversion with the near-linear-space ALRW framework to obtain improved Euclidean and Manhattan ANN bounds, achieving a query-time exponent that improves upon ALRW’s for many , while maintaining near-linear space and modest preprocessing costs. The work also demonstrates that list-of-points data structures are not optimal under Euclidean or Manhattan ANN, and extends the approach to a Manhattan-ANN bound via a standard embedding. Overall, the results deliver black-box space improvements and faster query times for high-dimensional ANN, with implications for broader use of implicit data structures and function-inversion techniques in similarity search.

Abstract

Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional point sets are obtained using techniques based on locality-sensitive hashing (LSH). Unfortunately, space efficiency is a major challenge for LSH-based data structures. Classic LSH techniques require a very large amount of space, oftentimes polynomial in |S|. A long line of work has developed intricate techniques to reduce this space usage, but these techniques suffer from downsides: they must be hand tailored to each specific LSH, are often complicated, and their space reduction comes at the cost of significantly increased query times. In this paper we explore a new way to improve the space efficiency of LSH using function inversion techniques, originally developed in (Fiat and Naor 2000). We begin by describing how function inversion can be used to improve LSH data structures. This gives a fairly simple, black box method to reduce LSH space usage. Then, we give a data structure that leverages function inversion to improve the query time of the best known near-linear space data structure for approximate nearest neighbor search under Euclidean distance: the ALRW data structure of (Andoni, Laarhoven, Razenshteyn, and Waingarten 2017). ALRW was previously shown to be optimal among "list-of-points" data structures for both Euclidean and Manhattan ANN; thus, in addition to giving improved bounds, our results imply that list-of-points data structures are not optimal for Euclidean or Manhattan ANN.
Paper Structure (33 sections, 15 theorems, 14 equations, 3 figures, 1 table)

This paper contains 33 sections, 15 theorems, 14 equations, 3 figures, 1 table.

Key Result

Theorem 1

For any locality-sensitive hash family $\mathcal{L}$ (where $\mathcal{L}$ has $\rho = \log p_1/\log p_2$ and evaluation time $T$, and storing a given $\ell\in \mathcal{L}$ requires $O(n^{1-\rho})$ space), approximation ratio $c$, and space-saving parameter $s < \rho$, there exists an ANN data struct

Figures (3)

  • Figure 1: A figure comparing $\alpha(c)$ from Theorem \ref{['thm:running_time']} to $ALRW(c)$. The $y$-axis represents the exponent of the query time: our results obtain a linear-space data structure with query time $n^{\alpha(c) + o(1)}$, compared to the state of the art in AndoniLaRa17 with query time $n^{ALRW(c) + o(1)}$.
  • Figure 2: A table comparing the exponent of the query time of our linear-space approach vs that of AndoniLaRa17. All values are rounded to the third decimal place. In the final column, we give the exponent of the preprocessing time(e.g. $n^{1.011 + o(1)}$ time for $c=1.5$).
  • Figure 3: Creating the truncated tree in two steps.

Theorems & Definitions (25)

  • Theorem 1
  • Theorem 2
  • Definition 1
  • Lemma 3: Chernoff52MitzenmacherUpfal17
  • Corollary 4
  • Lemma 5: FiatNaor00
  • Theorem 6
  • Lemma 7
  • proof
  • proof : Proof of Theorem \ref{['thm:fiatnaor_all']}
  • ...and 15 more