Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs
Sara Ahmadian, Edith Cohen
TL;DR
This work studies cardinality sketches under adaptive inputs and reveals fundamental vulnerabilities: a simple single-batch attack can bias standard estimators with $r=O(k)$ queries, and any correct estimator on common sketch designs can be compromised with $r=\tilde{O}(k^2)$ adaptive queries. The authors demonstrate these results theoretically using rank-domain analyses and probabilistic bounds, and validate them empirically by attacking HyperLogLog++ with as few as $4k$ queries, achieving substantial misestimation. They further develop a general attack framework against strategic QR algorithms that requires multiple batches, showing the vulnerability extends beyond the standard estimators. The findings highlight that robustness guarantees for composable cardinality sketches under adaptive workloads are fundamentally limited, motivating defenses and further exploration across broader sketch families.
Abstract
Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size $k$. In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the ``standard'' estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size $k$. Empirically, our attack used only $4k$ queries with the widely used HyperLogLog (HLL++)~\citep{hyperloglog:2007,hyperloglogpractice:EDBT2013} sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as \emph{any} estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in $k$, matching a generic upper bound.
