Table of Contents
Fetching ...

Fast Exact Retrieval for Nearest-neighbor Lookup (FERN)

Richard Zhu

TL;DR

FERN introduces a kd-tree–inspired binary-tree data structure for exact nearest-neighbor retrieval in high-dimensional spaces, using hyperplanes defined by child vectors to guide insertion and lookup. It achieves $O(d\log N)$ lookups with 100% recall for vectors with dimensionality up to $d\approx 128-784$ and database sizes up to $N\approx 10^7$, demonstrated on random high-dimensional vectors with strong empirical support. Unlike many bucketing or approximate methods, FERN targets exact retrieval while maintaining fast insertions and logarithmic-depth trees, with potential improvements via Red-Black balancing. The work highlights the critical role of hyperplane boundaries and boundary sharpness in pruning efficiency, and points to future directions including graph-based enhancements to push toward sub-linear exact retrieval in practice.

Abstract

Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension $d$ relative to the number of vectors, $N$, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a $O(Nd)$ problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases), \texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions (e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\% recall on 10 million $d=128$ uniformly randomly generated vectors.\footnote{Code available at https://github.com/RichardZhu123/ferns}

Fast Exact Retrieval for Nearest-neighbor Lookup (FERN)

TL;DR

FERN introduces a kd-tree–inspired binary-tree data structure for exact nearest-neighbor retrieval in high-dimensional spaces, using hyperplanes defined by child vectors to guide insertion and lookup. It achieves lookups with 100% recall for vectors with dimensionality up to and database sizes up to , demonstrated on random high-dimensional vectors with strong empirical support. Unlike many bucketing or approximate methods, FERN targets exact retrieval while maintaining fast insertions and logarithmic-depth trees, with potential improvements via Red-Black balancing. The work highlights the critical role of hyperplane boundaries and boundary sharpness in pruning efficiency, and points to future directions including graph-based enhancements to push toward sub-linear exact retrieval in practice.

Abstract

Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension relative to the number of vectors, , in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. or cases), \texttt{kd-trees} provide a algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a solution at high dimensions (e.g. ), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves look-up with 100\% recall on 10 million uniformly randomly generated vectors.\footnote{Code available at https://github.com/RichardZhu123/ferns}
Paper Structure (12 sections, 1 equation, 2 figures, 1 table, 2 algorithms)

This paper contains 12 sections, 1 equation, 2 figures, 1 table, 2 algorithms.

Figures (2)

  • Figure 1: FERN lookup with vectors where $d=128$ and look-up time is averaged over 1000 vectors randomly sampled from the database
  • Figure 2: FERN lookup using the train portion (60k-100k vectors) of popular Euclidean-distance-based vector retrieval benchmarks and look-up time is averaged over 1000 vectors randomly sampled from the database. We evaluate 4 decades on each dataset, which is why SIFT-128-Euclidean evaluation starts with vector databases of size $10^3$ rather than $600$