Data-Dependent LSH for the Earth Mover's Distance

Rajesh Jayaram; Erik Waingarten; Tian Zhang

Data-Dependent LSH for the Earth Mover's Distance

Rajesh Jayaram, Erik Waingarten, Tian Zhang

Abstract

We give new data-dependent locality sensitive hashing schemes (LSH) for the Earth Mover's Distance ($\mathsf{EMD}$), and as a result, improve the best approximation for nearest neighbor search under $\mathsf{EMD}$ by a quadratic factor. Here, the metric $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ consists of sets of $s$ vectors in $\mathbb{R}^d$, and for any two sets $x,y$ of $s$ vectors the distance $\mathsf{EMD}(x,y)$ is the minimum cost of a perfect matching between $x,y$, where the cost of matching two vectors is their $\ell_p$ distance. Previously, Andoni, Indyk, and Krauthgamer gave a (data-independent) locality-sensitive hashing scheme for $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ when $p \in [1,2]$ with approximation $O(\log^2 s)$. By being data-dependent, we improve the approximation to $\tilde{O}(\log s)$. Our main technical contribution is to show that for any distribution $μ$ supported on the metric $\mathsf{EMD}_s(\mathbb{R}^d, \ell_p)$, there exists a data-dependent LSH for dense regions of $μ$ which achieves approximation $\tilde{O}(\log s)$, and that the data-independent LSH actually achieves a $\tilde{O}(\log s)$-approximation outside of those dense regions. Finally, we show how to "glue" together these two hashing schemes without any additional loss in the approximation. Beyond nearest neighbor search, our data-dependent LSH also gives optimal (distributional) sketches for the Earth Mover's Distance. By known sketching lower bounds, this implies that our LSH is optimal (up to $\mathrm{poly}(\log \log s)$ factors) among those that collide close points with constant probability.

Data-Dependent LSH for the Earth Mover's Distance

Abstract

We give new data-dependent locality sensitive hashing schemes (LSH) for the Earth Mover's Distance (

), and as a result, improve the best approximation for nearest neighbor search under

by a quadratic factor. Here, the metric

consists of sets of

vectors in

, and for any two sets

vectors the distance

is the minimum cost of a perfect matching between

, where the cost of matching two vectors is their

distance. Previously, Andoni, Indyk, and Krauthgamer gave a (data-independent) locality-sensitive hashing scheme for

when

with approximation

. By being data-dependent, we improve the approximation to

. Our main technical contribution is to show that for any distribution

supported on the metric

, there exists a data-dependent LSH for dense regions of

which achieves approximation

, and that the data-independent LSH actually achieves a

-approximation outside of those dense regions. Finally, we show how to "glue" together these two hashing schemes without any additional loss in the approximation. Beyond nearest neighbor search, our data-dependent LSH also gives optimal (distributional) sketches for the Earth Mover's Distance. By known sketching lower bounds, this implies that our LSH is optimal (up to

factors) among those that collide close points with constant probability.

Paper Structure (44 sections, 31 theorems, 137 equations, 5 figures)

This paper contains 44 sections, 31 theorems, 137 equations, 5 figures.

Introduction
Overview of Contributions and Techniques
Data-Dependent Probabilistic Tree Embeddings CJLW22.
Tree Construction and Proof of Theorem \ref{['thm:warm-up']}.
An Improved Data-Dependent LSH for $\mathsf{EMD}$.
Step 2(a): Extensions on Chamfer Neighborhoods.
Step 2(b): Proof of Equation (\ref{['eq:better-chamfer']}) (in Section \ref{['sec:proofofexpand']}).
Other Related Work
Preliminaries
Nearest Neighbors, Embeddings, and Data-Dependent Hashing
Approximate Nearest Neighbor via Data-Dependent Hashing
Dynamic and Data-Dependent Probabilistic Tree Embeddings
Embedding for Subsets of the Hamming Cube
Data Structure for Dynamic, Data-Dependent Probabilistic Trees.
Analysis.
...and 29 more sections

Key Result

Theorem 1

For any constant $\epsilon > 0$ and $p \in [1,2]$, there is a data structure for nearest neighbor search in $\mathsf{EMD}_s(\mathbbm R^d,\ell_p)$, with approximation $\tilde{O}(\log s)$, pre-processing time $n^{1+\epsilon} \cdot \mathrm{poly}(sd)$, and query time $n^\epsilon \cdot \mathrm{poly}(sd)$

Figures (5)

Figure 1: The Data-Dependent $\textsc{QuadTree}$ Embedding.
Figure 2: Tree Embedding $\mathbf{T}$ Sampled from $\textsc{QuadTree}(\Omega)$. The root node is $v_0$ and the tree is generated by the maps ${\boldsymbol{\phi}} _0,\dots, {\boldsymbol{\phi}} _L$. Displayed are two vectors $x, y$ which map to the leaves of the tree, and their path (whose lowest common ancestor is $v_{\ell}(x) = v_{\ell}(y)$ is displayed. The distance $d_{\mathbf{T}}(x,y)$ is given by the sum of weights along the path from $x$ to $v_{\ell}(x) = v_{\ell}(y)$, and then back to $y$.
Figure 3: The $\textsc{SampleTree}$ Algorithm.
Figure 4: The $\textsc{Core-Preprocess}$ Algorithm.
Figure 5: The $\textsc{Core-Query}$ Algorithm.

Theorems & Definitions (55)

Theorem 1: Main Result---Informal version of Theorem \ref{['thm:ann-main']}
Theorem 2: Dynamic and Data-Dependent Probabilistic Tree Embedding
Remark 3: Using Classical Probabilistic Tree Embeddings
Theorem 4: Data-Dependent Hashing for $\mathsf{EMD}$ (Theorem \ref{['thm:data-dep-hashing']} + Lemma \ref{['lem:reduction-to-hypercube']})
Remark 5: ANN for EMD with small $s,d$
Remark 6: On Embedding $\ell_p$ into $\ell_1$
Definition 3.1: Approximate Near Neighbor
Definition 3.2: Data-Dependent Hashing
Definition 3.3: Data Structure for Data-Dependent Hashing
Theorem 7: Data-Dependent Hashing to Approximate Near Neighbors
...and 45 more

Data-Dependent LSH for the Earth Mover's Distance

Abstract

Data-Dependent LSH for the Earth Mover's Distance

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (55)