A smoothed-Bayesian approach to frequency recovery from sketched data
Mario Beraha, Stefano Favaro, Matteo Sesia
TL;DR
The paper addresses frequency recovery from sketched data by developing a smoothed-Bayesian framework that preserves scalability beyond traditional BNP methods. By modeling the data distribution with normalized random measures (notably Dirichlet processes and normalized generalized Gamma processes), the authors derive tractable, unbiased estimators for the query frequency that are linear in the bucket counts and can be efficiently extended to multiple hash functions via multi-view aggregation. They provide theoretical justification via minimax considerations and optimality within linear estimators, demonstrate practical performance gains over CMS and BNP baselines across synthetic and real datasets, and offer conformal inference techniques to obtain calibrated uncertainty intervals. The approach balances flexible tail modeling with computational efficiency, enabling scalable frequency and potential cardinality recovery in large-scale sketching applications.
Abstract
We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives.
