Learning-Based Hashing for ANN Search: Foundations and Early Advances
Sean Moran
TL;DR
This survey traces the foundations of learning-based hashing for approximate nearest neighbour search, contrasting data-independent LSH with data-driven projection and quantisation strategies. It shows how multi-threshold quantisation, variance-balancing projections, and cross-modal extensions progressively improved retrieval effectiveness before the deep-learning era reshaped the field. Key contributions include structured analyses of PCAH, SH, ITQ, AGH, and cross-modal methods (CVH, CRH, CMSSH, PDH, IMH), along with a critical look at evaluation practices. The paper highlights enduring lessons on data-aware encoding, the trade-off between code length and recall, and the importance of reproducible benchmarks, while pointing toward future opportunities in online, multilingual, and end-to-end learning frameworks.
Abstract
Approximate Nearest Neighbour (ANN) search is a fundamental problem in information retrieval, underpinning large-scale applications in computer vision, natural language processing, and cross-modal search. Hashing-based methods provide an efficient solution by mapping high-dimensional data into compact binary codes that enable fast similarity computations in Hamming space. Over the past two decades, a substantial body of work has explored learning to hash, where projection and quantisation functions are optimised from data rather than chosen at random. This article offers a foundational survey of early learning-based hashing methods, with an emphasis on the core ideas that shaped the field. We review supervised, unsupervised, and semi-supervised approaches, highlighting how projection functions are designed to generate meaningful embeddings and how quantisation strategies convert these embeddings into binary codes. We also examine extensions to multi-bit and multi-threshold models, as well as early advances in cross-modal retrieval. Rather than providing an exhaustive account of the most recent methods, our goal is to introduce the conceptual foundations of learning-based hashing for ANN search. By situating these early models in their historical context, we aim to equip readers with a structured understanding of the principles, trade-offs, and open challenges that continue to inform current research in this area.
