Fast Similarity Sketching

Søren Dahlgaard; Mathias Bæk Tejs Langhede; Jakob Bæk Tejs Houen; Mikkel Thorup

Fast Similarity Sketching

Søren Dahlgaard, Mathias Bæk Tejs Langhede, Jakob Bæk Tejs Houen, Mikkel Thorup

TL;DR

This work addresses efficient, strongly concentrated similarity sketching for set similarity under the Jaccard measure $J(A,B)$. It introduces a novel sketch that blends sampling-with- and sampling-without-replacement, achieving unbiased estimates with Chernoff-type concentration and near-linear construction time $O(|A|+t\log t)$, while maintaining a useful alignment property. The authors integrate the sketch into an enhanced LSH framework, achieving near-optimal space and query-time trade-offs through tensoring and careful filtering, with a practical implementation based on mixed tabulation hashing. The result is a scalable approach for large-scale similarity search and learning tasks that require accurate, fast estimation of set similarity with strong probabilistic guarantees.

Abstract

We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) = |A\cap B|/|A\cup B|$ between sets $A$ and $B$ is preserved. More precisely, define $X_i = [S(A)[i] = S(B)[i]]$ and $X = \sum_{i\in [t]} X_i$. We want $E[X_i]=J(A,B)$, and we want $X$ to be strongly concentrated around $E[X] = t \cdot J(A,B)$ (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called $\textit{sketches}$. Strong concentration is critical, for often we want to sketch many sets $B_1,\ldots,B_n$ so that we later, for a query set $A$, can find (one of) the most similar $B_i$. It is then critical that no $B_i$ looks much more similar to $A$ due to errors in the sketch. The seminal $t\times\textit{MinHash}$ algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)

Fast Similarity Sketching

TL;DR

This work addresses efficient, strongly concentrated similarity sketching for set similarity under the Jaccard measure

. It introduces a novel sketch that blends sampling-with- and sampling-without-replacement, achieving unbiased estimates with Chernoff-type concentration and near-linear construction time

, while maintaining a useful alignment property. The authors integrate the sketch into an enhanced LSH framework, achieving near-optimal space and query-time trade-offs through tensoring and careful filtering, with a practical implementation based on mixed tabulation hashing. The result is a scalable approach for large-scale similarity search and learning tasks that require accurate, fast estimation of set similarity with strong probabilistic guarantees.

Abstract

We consider the

problem: Given a universe

we want a random function

mapping subsets

into vectors

of size

, such that the Jaccard similarity

between sets

and

is preserved. More precisely, define

and

. We want

, and we want

to be strongly concentrated around

(i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors

are also called

. Strong concentration is critical, for often we want to sketch many sets

so that we later, for a query set

, can find (one of) the most similar

. It is then critical that no

looks much more similar to

due to errors in the sketch. The seminal

algorithm uses

random hash functions

, and stores

as the sketch of

. The main drawback of MinHash is, however, its

running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)

Fast Similarity Sketching

TL;DR

Abstract

Fast Similarity Sketching

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (25)