Sampling Methods for Inner Product Sketching
Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang
TL;DR
This work advances inner product sketching by introducing two coordinated sampling schemes, Threshold Sampling and Priority Sampling, which achieve strong error guarantees—matching the best known bounds for Weighted MinHash—while enabling linear-time sketch construction and fixed-size sketches. By coordinating samples via shared randomness and weighting by squared magnitude, these methods yield unbiased estimators with provable variance bounds and practical speedups over linear sketches like Johnson–Lindenstrauss and CountSketch. The authors demonstrate a black-box reduction from join-correlation estimation to inner product estimation and show substantial improvements across synthetic and real-world datasets, including World Bank Finances, 20 Newsgroups, and TPC-H/Twitter benchmarks. The results indicate that threshold- and priority-based sampling not only improve accuracy but also substantially reduce compute time, making them attractive for large-scale data discovery and correlation-estimation tasks.
Abstract
Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
