On the adversarial robustness of Locality-Sensitive Hashing in Hamming space
Michael Kapralov, Mikhail Makarov, Christian Sohler
TL;DR
The paper analyzes the susceptibility of LSH in Hamming space to adaptive queries and demonstrates that an adversary can force false negatives much more efficiently than random sampling when an isolated data point exists. It introduces a collision-aware adaptive walk that gradually eliminates hash collisions to yield a query $q$ within distance $\le r$ that maps to $\bot$, and provides both simple and fast variants with provable query-complexity bounds. The work contributes a rigorous adversarial framework, lemmas bounding collision behavior, and empirical evidence across synthetic and real datasets, highlighting practical implications for deploying LSH with adaptive adversaries. The findings motivate the use of robustness-enhanced LSH defenses, such as Las-Vegas constructions and differential-privacy-based approaches, to preserve reliability in sensitive settings.
Abstract
Locality-sensitive hashing~[Indyk,Motwani'98] is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query. In many modern applications of nearest neighbor search the queries are chosen adaptively. In this paper, we study the robustness of the locality-sensitive hashing to adaptive queries in Hamming space. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails on. Crucially, our adaptive algorithm finds the hard query exponentially faster than random sampling.
