Table of Contents
Fetching ...

A smoothed-Bayesian approach to frequency recovery from sketched data

Mario Beraha, Stefano Favaro, Matteo Sesia

TL;DR

The paper addresses frequency recovery from sketched data by developing a smoothed-Bayesian framework that preserves scalability beyond traditional BNP methods. By modeling the data distribution with normalized random measures (notably Dirichlet processes and normalized generalized Gamma processes), the authors derive tractable, unbiased estimators for the query frequency that are linear in the bucket counts and can be efficiently extended to multiple hash functions via multi-view aggregation. They provide theoretical justification via minimax considerations and optimality within linear estimators, demonstrate practical performance gains over CMS and BNP baselines across synthetic and real datasets, and offer conformal inference techniques to obtain calibrated uncertainty intervals. The approach balances flexible tail modeling with computational efficiency, enabling scalable frequency and potential cardinality recovery in large-scale sketching applications.

Abstract

We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives.

A smoothed-Bayesian approach to frequency recovery from sketched data

TL;DR

The paper addresses frequency recovery from sketched data by developing a smoothed-Bayesian framework that preserves scalability beyond traditional BNP methods. By modeling the data distribution with normalized random measures (notably Dirichlet processes and normalized generalized Gamma processes), the authors derive tractable, unbiased estimators for the query frequency that are linear in the bucket counts and can be efficiently extended to multiple hash functions via multi-view aggregation. They provide theoretical justification via minimax considerations and optimality within linear estimators, demonstrate practical performance gains over CMS and BNP baselines across synthetic and real datasets, and offer conformal inference techniques to obtain calibrated uncertainty intervals. The approach balances flexible tail modeling with computational efficiency, enabling scalable frequency and potential cardinality recovery in large-scale sketching applications.

Abstract

We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives.
Paper Structure (69 sections, 11 theorems, 135 equations, 9 figures, 7 tables)

This paper contains 69 sections, 11 theorems, 135 equations, 9 figures, 7 tables.

Key Result

Theorem 1

For $n\geq1$, suppose $(x_{1},\ldots,x_{n})$ is a random sample $\mathbf{X}_{n}$ from eq:exchangeable_model_hash with corresponding sketch $\mathbf{C}_{J}=\mathbf{c}$, obtained using a fixed hash function $h$. If $\mathbb{S}_{j}:=\{s\in\mathbb{S}\text{ : }h(s)=j\}$ and $q_{j}:=\mathrm{Pr}[h(X_{i})=j

Figures (9)

  • Figure 1: Visualization of the modelling flexibility of the NGGP. The probabilities $\pi_j(r)$ are plotted as a function of $r$ for $c_j = 250$, for different values of the NGGP smoothing parameters. Different panels focus on different values of $\theta$, while the curves drawn in different colors correspond to alternative values of $\alpha$. In all cases, $\tau = 1$.
  • Figure 2: MAEs for the frequency estimators with DP and NGG smoothing, in experiments on synthetic data from a Pitman-Yor process with parameters $\gamma$ (varies across the $x$-axis) and $\sigma=0.75$ (see Section \ref{['sec:ex_singlehash']}). Different plots correspond to different frequency bins.
  • Figure 3: MAEs for the frequency estimators in Section \ref{['sec:ex_multihash']}, stratified by true frequency bins. Top row, data generated from a Pitman-Yor process with parameters $(100, 0.25)$. Bottom row, data from a Pitman-Yor process with parameters $(100, 0.75)$. In the bottom row, the PoE-NGG and MIN-NGG lines are overlapping.
  • Figure 4: Calibration and average lengths of the intervals derived from the product of expert and the min aggregation rule using an NGGP smoothing for two different data generating processes, in experiments involving multiple independent hash functions.The results are shown as a function of the hash width, while the total memory budget is fixed.
  • Figure 5: MAEs for the frequency estimators on the Covid-DNA sequences (top row) and Gutenberg corpus bigrams (bottom row), stratified by true frequency bins. MAEs of the CMS estimator are not reported for the Covid-DNA sequences as they are much higher.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Theorem 2: Informal statement
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Lemma 1
  • proof
  • Proposition 6
  • Theorem 7
  • proof
  • ...and 3 more