Table of Contents
Fetching ...

Sampling Methods for Inner Product Sketching

Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang

TL;DR

This work advances inner product sketching by introducing two coordinated sampling schemes, Threshold Sampling and Priority Sampling, which achieve strong error guarantees—matching the best known bounds for Weighted MinHash—while enabling linear-time sketch construction and fixed-size sketches. By coordinating samples via shared randomness and weighting by squared magnitude, these methods yield unbiased estimators with provable variance bounds and practical speedups over linear sketches like Johnson–Lindenstrauss and CountSketch. The authors demonstrate a black-box reduction from join-correlation estimation to inner product estimation and show substantial improvements across synthetic and real-world datasets, including World Bank Finances, 20 Newsgroups, and TPC-H/Twitter benchmarks. The results indicate that threshold- and priority-based sampling not only improve accuracy but also substantially reduce compute time, making them attractive for large-scale data discovery and correlation-estimation tasks.

Abstract

Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.

Sampling Methods for Inner Product Sketching

TL;DR

This work advances inner product sketching by introducing two coordinated sampling schemes, Threshold Sampling and Priority Sampling, which achieve strong error guarantees—matching the best known bounds for Weighted MinHash—while enabling linear-time sketch construction and fixed-size sketches. By coordinating samples via shared randomness and weighting by squared magnitude, these methods yield unbiased estimators with provable variance bounds and practical speedups over linear sketches like Johnson–Lindenstrauss and CountSketch. The authors demonstrate a black-box reduction from join-correlation estimation to inner product estimation and show substantial improvements across synthetic and real-world datasets, including World Bank Finances, 20 Newsgroups, and TPC-H/Twitter benchmarks. The results indicate that threshold- and priority-based sampling not only improve accuracy but also substantially reduce compute time, making them attractive for large-scale data discovery and correlation-estimation tasks.

Abstract

Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
Paper Structure (24 sections, 4 theorems, 21 equations, 13 figures, 2 tables, 6 algorithms)

This paper contains 24 sections, 4 theorems, 21 equations, 13 figures, 2 tables, 6 algorithms.

Key Result

Theorem 1

For vectors $\mathbf{a},\mathbf{b} \in \mathop{\mathrm{\mathbb{R}}}\limits^n$ and target sketch size $m$, let $\mathcal{S}(\mathbf{a})=\{K_{\mathbf{a}}, V_{\mathbf{a}}, \tau_{\mathbf{a}}\}$ and $\mathcal{S}(\mathbf{b})=\{K_{\mathbf{b}}, V_{\mathbf{b}}, \tau_{\mathbf{b}}\}$ be sketches returned by al Moreover, let $|K_\mathbf{a}|$ and $|K_\mathbf{b}|$ be the number of index/values pairs stored in $

Figures (13)

  • Figure 1: Sketching with Threshold Sampling (\ref{['alg:threshold_sampling']}).
  • Figure 2: Join-Correlation via inner product sketching.
  • Figure 3: Inner product estimation for real-valued synthetic data. The lines for PS-uniform and TS-uniform overlap, as do the lines for our PS-weighted and TS-weighted methods. As predicted by our theoretical results, PS-weighted and TS-weighted consistently outperform all other baselines.
  • Figure 4: Inner product estimation for synthetic binary data. Weighted sampling methods are excluded since they are equivalent to their unweighted counterparts for binary vectors. Our PS-uniform and TS-uniform methods outperform both linear sketches and MH for computing inner products.
  • Figure 5: Comparison of End-Biased Sampling (TS-1norm) and its Priority Sampling counterpart (PS-1norm) against our TS-weighted and PS-weighted methods.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Corollary 2
  • Theorem 3
  • Lemma 4