Table of Contents
Fetching ...

Fast Similarity Sketching

Søren Dahlgaard, Mathias Bæk Tejs Langhede, Jakob Bæk Tejs Houen, Mikkel Thorup

TL;DR

This work addresses efficient, strongly concentrated similarity sketching for set similarity under the Jaccard measure $J(A,B)$. It introduces a novel sketch that blends sampling-with- and sampling-without-replacement, achieving unbiased estimates with Chernoff-type concentration and near-linear construction time $O(|A|+t\log t)$, while maintaining a useful alignment property. The authors integrate the sketch into an enhanced LSH framework, achieving near-optimal space and query-time trade-offs through tensoring and careful filtering, with a practical implementation based on mixed tabulation hashing. The result is a scalable approach for large-scale similarity search and learning tasks that require accurate, fast estimation of set similarity with strong probabilistic guarantees.

Abstract

We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) = |A\cap B|/|A\cup B|$ between sets $A$ and $B$ is preserved. More precisely, define $X_i = [S(A)[i] = S(B)[i]]$ and $X = \sum_{i\in [t]} X_i$. We want $E[X_i]=J(A,B)$, and we want $X$ to be strongly concentrated around $E[X] = t \cdot J(A,B)$ (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called $\textit{sketches}$. Strong concentration is critical, for often we want to sketch many sets $B_1,\ldots,B_n$ so that we later, for a query set $A$, can find (one of) the most similar $B_i$. It is then critical that no $B_i$ looks much more similar to $A$ due to errors in the sketch. The seminal $t\times\textit{MinHash}$ algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)

Fast Similarity Sketching

TL;DR

This work addresses efficient, strongly concentrated similarity sketching for set similarity under the Jaccard measure . It introduces a novel sketch that blends sampling-with- and sampling-without-replacement, achieving unbiased estimates with Chernoff-type concentration and near-linear construction time , while maintaining a useful alignment property. The authors integrate the sketch into an enhanced LSH framework, achieving near-optimal space and query-time trade-offs through tensoring and careful filtering, with a practical implementation based on mixed tabulation hashing. The result is a scalable approach for large-scale similarity search and learning tasks that require accurate, fast estimation of set similarity with strong probabilistic guarantees.

Abstract

We consider the problem: Given a universe we want a random function mapping subsets into vectors of size , such that the Jaccard similarity between sets and is preserved. More precisely, define and . We want , and we want to be strongly concentrated around (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors are also called . Strong concentration is critical, for often we want to sketch many sets so that we later, for a query set , can find (one of) the most similar . It is then critical that no looks much more similar to due to errors in the sketch. The seminal algorithm uses random hash functions , and stores as the sketch of . The main drawback of MinHash is, however, its running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)

Paper Structure

This paper contains 10 sections, 13 theorems, 72 equations, 2 figures, 1 algorithm.

Key Result

Theorem 1

This powerful theorem was not in our original conference version DahlgaardKT17. Let $[u] = \left \{ 0,1,2,\ldots,u-1 \right \}$ be a set of keys and let $t$ be a positive integer. There exists an algorithm that given a set $A \subseteq [u]$ in expected time $O\left(\left | A \right |+t \log t\right)

Figures (2)

  • Figure 1: Experimental evaluation of similarity estimation of the sets $A = \{1,2\}$ and $B = \{2,3\}$ with different similarity sketches and $t=16$. Each experiment is repeated 2000 times and the $y$-axis reports the frequency of each estimate. The green line indicates the actual similarity. The two methods based on OPH perform poorly as each set has a probability of $1/t$ to be a single-element sketch. Our new method outperforms $t\times$MinHash as it has an element of "without replacement".
  • Figure 2: The intermediate sketch $S(A)$ is first partitioned into $2M$ segments which corresponds to the $2M$ subexperiments. Each of these segments then partitioned further into $K$ blocks of size $S$, which corresponds to the $K$ entries in the $L$ sketches in each of the subexperiments.

Theorems & Definitions (25)

  • Theorem 1
  • Corollary 1
  • Definition 1: Locality sensitive hashing IndykM98
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 15 more