Table of Contents
Fetching ...

Learning-Based Hashing for ANN Search: Foundations and Early Advances

Sean Moran

TL;DR

This survey traces the foundations of learning-based hashing for approximate nearest neighbour search, contrasting data-independent LSH with data-driven projection and quantisation strategies. It shows how multi-threshold quantisation, variance-balancing projections, and cross-modal extensions progressively improved retrieval effectiveness before the deep-learning era reshaped the field. Key contributions include structured analyses of PCAH, SH, ITQ, AGH, and cross-modal methods (CVH, CRH, CMSSH, PDH, IMH), along with a critical look at evaluation practices. The paper highlights enduring lessons on data-aware encoding, the trade-off between code length and recall, and the importance of reproducible benchmarks, while pointing toward future opportunities in online, multilingual, and end-to-end learning frameworks.

Abstract

Approximate Nearest Neighbour (ANN) search is a fundamental problem in information retrieval, underpinning large-scale applications in computer vision, natural language processing, and cross-modal search. Hashing-based methods provide an efficient solution by mapping high-dimensional data into compact binary codes that enable fast similarity computations in Hamming space. Over the past two decades, a substantial body of work has explored learning to hash, where projection and quantisation functions are optimised from data rather than chosen at random. This article offers a foundational survey of early learning-based hashing methods, with an emphasis on the core ideas that shaped the field. We review supervised, unsupervised, and semi-supervised approaches, highlighting how projection functions are designed to generate meaningful embeddings and how quantisation strategies convert these embeddings into binary codes. We also examine extensions to multi-bit and multi-threshold models, as well as early advances in cross-modal retrieval. Rather than providing an exhaustive account of the most recent methods, our goal is to introduce the conceptual foundations of learning-based hashing for ANN search. By situating these early models in their historical context, we aim to equip readers with a structured understanding of the principles, trade-offs, and open challenges that continue to inform current research in this area.

Learning-Based Hashing for ANN Search: Foundations and Early Advances

TL;DR

This survey traces the foundations of learning-based hashing for approximate nearest neighbour search, contrasting data-independent LSH with data-driven projection and quantisation strategies. It shows how multi-threshold quantisation, variance-balancing projections, and cross-modal extensions progressively improved retrieval effectiveness before the deep-learning era reshaped the field. Key contributions include structured analyses of PCAH, SH, ITQ, AGH, and cross-modal methods (CVH, CRH, CMSSH, PDH, IMH), along with a critical look at evaluation practices. The paper highlights enduring lessons on data-aware encoding, the trade-off between code length and recall, and the importance of reproducible benchmarks, while pointing toward future opportunities in online, multilingual, and end-to-end learning frameworks.

Abstract

Approximate Nearest Neighbour (ANN) search is a fundamental problem in information retrieval, underpinning large-scale applications in computer vision, natural language processing, and cross-modal search. Hashing-based methods provide an efficient solution by mapping high-dimensional data into compact binary codes that enable fast similarity computations in Hamming space. Over the past two decades, a substantial body of work has explored learning to hash, where projection and quantisation functions are optimised from data rather than chosen at random. This article offers a foundational survey of early learning-based hashing methods, with an emphasis on the core ideas that shaped the field. We review supervised, unsupervised, and semi-supervised approaches, highlighting how projection functions are designed to generate meaningful embeddings and how quantisation strategies convert these embeddings into binary codes. We also examine extensions to multi-bit and multi-threshold models, as well as early advances in cross-modal retrieval. Rather than providing an exhaustive account of the most recent methods, our goal is to introduce the conceptual foundations of learning-based hashing for ANN search. By situating these early models in their historical context, we aim to equip readers with a structured understanding of the principles, trade-offs, and open challenges that continue to inform current research in this area.

Paper Structure

This paper contains 67 sections, 91 equations, 33 figures, 2 tables, 5 algorithms.

Figures (33)

  • Figure 1: The amount of images being uploaded to popular social media websites (Facebook, Flickr) and mobile applications (Instagram, SnapChat, WhatsApp) has undergone a dramatic growth since 2005. Efficient algorithms for searching through such large image datasets are needed now more than ever. This chart has been copied directly from slide 62 of the talk "Internet Trends 2014 - Code Conference" given by the venture capitalist Mary Meeker of Kleiner Perkins Caufield Byers (KPCB): http://www.kpcb.com/blog/2014-internet-trends (URL accessed on 16/12/15).
  • Figure 2: Nearest neighbour search with hashcodes. Similarity preserving binary codes generated by a hash function $\mathcal{H}$ can be used as the indices into the buckets of a hashtable for constant time search. Only those images that are in the same bucket as the query need be compared thereby reducing the size of the search space. The focus of this review is learning the hash function $\mathcal{H}$ to maximise the similarity of hashcodes for similar data-points. On the right-hand side we present examples of tasks for which nearest neighbour search has proved to be fundamental: from content-based information retrieval (IR) to near duplicate detection and location recognition. The three images on the right have been taken from Imense Ltd (http://www.imense.com) and Doersch12Xu10Grauman13.
  • Figure 3: The projection and quantisation operations. In Figure \ref{['fig:ch1_pipeline_1']} a 2D space is partitioned with two hyperplanes $\textbf{h}_{1}$ and $\textbf{h}_{2}$ with normal vectors $\textbf{w}_{1}, \textbf{w}_{2}$ creating four buckets. Data-points are shown as coloured shapes, with similar data-points having the same colour and shape. The hashcode for each data-point is found by taking the dot-product of the feature representation onto the normal vectors ($\textbf{w}_{1}$, $\textbf{w}_{2}$) of each hyperplane. The resulting projected dimensions are binarised by thresholding at zero (Figure \ref{['fig:ch1_pipeline_2']}) with two thresholds $t_{1}$, $t_{2}$. Concatenating the resulting bits yields a 2-bit hashcode for each data-point (indicated by the unfilled squares). For example the projection of data-point $a$ is greater than threshold $t_{1}$ and so a '1' is appended to its hashcode. Data-point $a$'s projection onto normal vector $\textbf{w}_{2}$ is also greater than $t_{2}$ and so a '1' is further appended to its hashcode. The hashcode for data-point $a$ is therefore '11' which is also the label for the top-right region of the feature space in Figure \ref{['fig:ch1_pipeline_1']}.
  • Figure 4: Overview of one possible categorisation of the field of hashing-based ANN search. The main categories are shown in grey, while specific models are listed in white alongside their corresponding section numbers.
  • Figure 5: The $(c,R)$-approximate NN problem: in many applications it is acceptable to retrieve a data point (circle) within distance $cR$ of the query point $\textbf{x}$, where $R$ is the distance to the exact NN.
  • ...and 28 more figures

Theorems & Definitions (3)

  • Definition 4.1: Randomised $c$-approximate $R$-near neighbour problem
  • Definition 4.2: Randomised $R$-near neighbour problem
  • Definition 4.3: Locality-sensitive hash function family