Table of Contents
Fetching ...

On the adversarial robustness of Locality-Sensitive Hashing in Hamming space

Michael Kapralov, Mikhail Makarov, Christian Sohler

TL;DR

The paper analyzes the susceptibility of LSH in Hamming space to adaptive queries and demonstrates that an adversary can force false negatives much more efficiently than random sampling when an isolated data point exists. It introduces a collision-aware adaptive walk that gradually eliminates hash collisions to yield a query $q$ within distance $\le r$ that maps to $\bot$, and provides both simple and fast variants with provable query-complexity bounds. The work contributes a rigorous adversarial framework, lemmas bounding collision behavior, and empirical evidence across synthetic and real datasets, highlighting practical implications for deploying LSH with adaptive adversaries. The findings motivate the use of robustness-enhanced LSH defenses, such as Las-Vegas constructions and differential-privacy-based approaches, to preserve reliability in sensitive settings.

Abstract

Locality-sensitive hashing~[Indyk,Motwani'98] is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query. In many modern applications of nearest neighbor search the queries are chosen adaptively. In this paper, we study the robustness of the locality-sensitive hashing to adaptive queries in Hamming space. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails on. Crucially, our adaptive algorithm finds the hard query exponentially faster than random sampling.

On the adversarial robustness of Locality-Sensitive Hashing in Hamming space

TL;DR

The paper analyzes the susceptibility of LSH in Hamming space to adaptive queries and demonstrates that an adversary can force false negatives much more efficiently than random sampling when an isolated data point exists. It introduces a collision-aware adaptive walk that gradually eliminates hash collisions to yield a query within distance that maps to , and provides both simple and fast variants with provable query-complexity bounds. The work contributes a rigorous adversarial framework, lemmas bounding collision behavior, and empirical evidence across synthetic and real datasets, highlighting practical implications for deploying LSH with adaptive adversaries. The findings motivate the use of robustness-enhanced LSH defenses, such as Las-Vegas constructions and differential-privacy-based approaches, to preserve reliability in sensitive settings.

Abstract

Locality-sensitive hashing~[Indyk,Motwani'98] is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query. In many modern applications of nearest neighbor search the queries are chosen adaptively. In this paper, we study the robustness of the locality-sensitive hashing to adaptive queries in Hamming space. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails on. Crucially, our adaptive algorithm finds the hard query exponentially faster than random sampling.
Paper Structure (20 sections, 16 theorems, 17 equations, 7 figures, 2 algorithms)

This paper contains 20 sections, 16 theorems, 17 equations, 7 figures, 2 algorithms.

Key Result

Theorem 1.1

Given an 'isolated point' $z$ in a dataset $P$, one can find a point $q$ at a distance at most $r$ from $z$ such that querying LSH with $q$ returns no point. The number of queries to the LSH data structure needed to generate the point $q$ is bounded by $O(\log (cr) \cdot \log(1/\delta))$, where $\de

Figures (7)

  • Figure 1: Dependence of success probability on various parameters. All experiments are done on the Random dataset, with the third one also featuring the Zero dataset.
  • Figure 2: Dependence of the success probability on the value of $t$ and on the value of desired distance of false negative query from the origin.
  • Figure 3: Desired distance of the query not hashing with the origin point from the origin point $z$ for different values of $\lambda$. The experiments are conducted on the Zero dataset for the default parameter setting. The distance equal to $r$ is denoted by a dashed vertical line.
  • Figure 4: The number of queries as a function of the distance of $q$ from origin point $z$ on all tested datasets.
  • Figure 5: Mean number of queries necessary to find a false negative depending on parameter $\lambda$ required by our adaptive adversary and random sampling of query points.
  • ...and 2 more figures

Theorems & Definitions (31)

  • Theorem 1.1: Informal version of Theorem \ref{['thm:fast']}
  • Definition 3.1: IndykM98
  • Theorem 3.3: IndykM98
  • Definition 3.4: False negative
  • Lemma 4.0: Bounds on $k$
  • Lemma 4.0: Support lower bound
  • Lemma 4.0
  • proof : Proof outline
  • Lemma 4.1
  • proof
  • ...and 21 more