Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

Sara Ahmadian; Edith Cohen

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

Sara Ahmadian, Edith Cohen

TL;DR

This work studies cardinality sketches under adaptive inputs and reveals fundamental vulnerabilities: a simple single-batch attack can bias standard estimators with $r=O(k)$ queries, and any correct estimator on common sketch designs can be compromised with $r=\tilde{O}(k^2)$ adaptive queries. The authors demonstrate these results theoretically using rank-domain analyses and probabilistic bounds, and validate them empirically by attacking HyperLogLog++ with as few as $4k$ queries, achieving substantial misestimation. They further develop a general attack framework against strategic QR algorithms that requires multiple batches, showing the vulnerability extends beyond the standard estimators. The findings highlight that robustness guarantees for composable cardinality sketches under adaptive workloads are fundamentally limited, motivating defenses and further exploration across broader sketch families.

Abstract

Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size $k$. In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the ``standard'' estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size $k$. Empirically, our attack used only $4k$ queries with the widely used HyperLogLog (HLL++)~\citep{hyperloglog:2007,hyperloglogpractice:EDBT2013} sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as \emph{any} estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in $k$, matching a generic upper bound.

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

TL;DR

This work studies cardinality sketches under adaptive inputs and reveals fundamental vulnerabilities: a simple single-batch attack can bias standard estimators with

queries, and any correct estimator on common sketch designs can be compromised with

adaptive queries. The authors demonstrate these results theoretically using rank-domain analyses and probabilistic bounds, and validate them empirically by attacking HyperLogLog++ with as few as

queries, achieving substantial misestimation. They further develop a general attack framework against strategic QR algorithms that requires multiple batches, showing the vulnerability extends beyond the standard estimators. The findings highlight that robustness guarantees for composable cardinality sketches under adaptive workloads are fundamentally limited, motivating defenses and further exploration across broader sketch families.

Abstract

. In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the ``standard'' estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size

. Empirically, our attack used only

queries with the widely used HyperLogLog (HLL++)~\citep{hyperloglog:2007,hyperloglogpractice:EDBT2013} sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as \emph{any} estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in

, matching a generic upper bound.

Paper Structure (32 sections, 18 theorems, 63 equations, 3 figures)

This paper contains 32 sections, 18 theorems, 63 equations, 3 figures.

Introduction
Related Work
Preliminaries
Composable Cardinality Sketches
MinHash sketches
Domain sampling
Specifying keys for the sketch
Attack on the "standard" estimators
Analysis Highlights
Experimental Evaluation
Experiment setup.
Efficacy with a varying number of queries
Efficacy of the attack with a varying sketch sizes
Attack Setup Against Strategic Estimators
Attack Framework
...and 17 more sections

Key Result

Theorem 4.1

Consider Algorithm standardattack:algo with $k$-mins or bottom-$k$ sketches and $T(S)$ being the inverse of the cardinality estimate as specified in Section sketches:sec. For $\alpha > 0$, set the parameters $n=\Omega(\frac{1}{\alpha} k \log(kr))$ and $r= O\left(\frac{k}{\alpha^2} \right)$. Then wit

Figures (3)

Figure 1: Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with lowest average score after $r=4^i$ queries.
Figure 2: Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with largest average score after $r=4^i$ queries.
Figure 3: Attack on HLL++ for varying sketch sizes while utilizing queries of size 4 times the sketch size.

Theorems & Definitions (49)

Definition 3.1: Sufficient Statistics
Definition 3.2: bias of the sketch
Theorem 4.1: Utility of Algorithm \ref{['standardattack:algo']}
Remark 6.2
Definition 6.3: Correct Map
Remark 6.4: Many correct maps
Lemma 6.5: Multiple batches are necessary
proof
Theorem 7.1: Utility of Algorithm \ref{['onebatchgen:algo']} with symmetric maps
Definition 7.2: symmetric map
...and 39 more

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

TL;DR

Abstract

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (49)