Table of Contents
Fetching ...

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

Sara Ahmadian, Edith Cohen

TL;DR

This work studies cardinality sketches under adaptive inputs and reveals fundamental vulnerabilities: a simple single-batch attack can bias standard estimators with $r=O(k)$ queries, and any correct estimator on common sketch designs can be compromised with $r=\tilde{O}(k^2)$ adaptive queries. The authors demonstrate these results theoretically using rank-domain analyses and probabilistic bounds, and validate them empirically by attacking HyperLogLog++ with as few as $4k$ queries, achieving substantial misestimation. They further develop a general attack framework against strategic QR algorithms that requires multiple batches, showing the vulnerability extends beyond the standard estimators. The findings highlight that robustness guarantees for composable cardinality sketches under adaptive workloads are fundamentally limited, motivating defenses and further exploration across broader sketch families.

Abstract

Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size $k$. In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the ``standard'' estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size $k$. Empirically, our attack used only $4k$ queries with the widely used HyperLogLog (HLL++)~\citep{hyperloglog:2007,hyperloglogpractice:EDBT2013} sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as \emph{any} estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in $k$, matching a generic upper bound.

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

TL;DR

This work studies cardinality sketches under adaptive inputs and reveals fundamental vulnerabilities: a simple single-batch attack can bias standard estimators with queries, and any correct estimator on common sketch designs can be compromised with adaptive queries. The authors demonstrate these results theoretically using rank-domain analyses and probabilistic bounds, and validate them empirically by attacking HyperLogLog++ with as few as queries, achieving substantial misestimation. They further develop a general attack framework against strategic QR algorithms that requires multiple batches, showing the vulnerability extends beyond the standard estimators. The findings highlight that robustness guarantees for composable cardinality sketches under adaptive workloads are fundamentally limited, motivating defenses and further exploration across broader sketch families.

Abstract

Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size . In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the ``standard'' estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size . Empirically, our attack used only queries with the widely used HyperLogLog (HLL++)~\citep{hyperloglog:2007,hyperloglogpractice:EDBT2013} sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as \emph{any} estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in , matching a generic upper bound.
Paper Structure (32 sections, 18 theorems, 63 equations, 3 figures)

This paper contains 32 sections, 18 theorems, 63 equations, 3 figures.

Key Result

Theorem 4.1

Consider Algorithm standardattack:algo with $k$-mins or bottom-$k$ sketches and $T(S)$ being the inverse of the cardinality estimate as specified in Section sketches:sec. For $\alpha > 0$, set the parameters $n=\Omega(\frac{1}{\alpha} k \log(kr))$ and $r= O\left(\frac{k}{\alpha^2} \right)$. Then wit

Figures (3)

  • Figure 1: Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with lowest average score after $r=4^i$ queries.
  • Figure 2: Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with largest average score after $r=4^i$ queries.
  • Figure 3: Attack on HLL++ for varying sketch sizes while utilizing queries of size 4 times the sketch size.

Theorems & Definitions (49)

  • Definition 3.1: Sufficient Statistics
  • Definition 3.2: bias of the sketch
  • Theorem 4.1: Utility of Algorithm \ref{['standardattack:algo']}
  • Remark 6.2
  • Definition 6.3: Correct Map
  • Remark 6.4: Many correct maps
  • Lemma 6.5: Multiple batches are necessary
  • proof
  • Theorem 7.1: Utility of Algorithm \ref{['onebatchgen:algo']} with symmetric maps
  • Definition 7.2: symmetric map
  • ...and 39 more