Bayesian Advantage of Re-Identification Attack in the Shuffle Model
Pengcheng Su, Haibo Cheng, Ping Wang
TL;DR
The paper analyzes re-identification risk in the shuffle model by formalizing the Bayesian success probability $β_n(P,Q)$ for identifying a sample drawn from $P$ among $n-1$ samples drawn from $Q$, and by defining additive and multiplicative Bayesian advantages. It provides an exact expression for $β_n(P,Q)$ using likelihood-ratio distributions and characterizes its asymptotics, including a tight relation to the total variation distance $Δ(P,Q)$. The authors extend the analysis to shuffle differential privacy, showing that for an $ε$-DP local randomizer the re-identification probability is bounded by $β_n(\mathcal{R}) \le \frac{e^{ε}}{n}$, and they develop a decomposition-based framework (clone and blanket) to obtain tight bounds, with the blanket approach shown to be optimal among decompositions. The results offer a principled view of anonymity leakage in shuffle-based systems and furnish quantitative guidance for honeyword-style defenses and privacy amplification in Shuffle DP, bridging information-theoretic attack analysis with practical privacy guarantees.
Abstract
The shuffle model, which anonymizes data by randomly permuting user messages, has been widely adopted in both cryptography and differential privacy. In this work, we present the first systematic study of the Bayesian advantage in re-identifying a user's message under the shuffle model. We begin with a basic setting: one sample is drawn from a distribution $P$, and $n - 1$ samples are drawn from a distribution $Q$, after which all $n$ samples are randomly shuffled. We define $β_n(P, Q)$ as the success probability of a Bayes-optimal adversary in identifying the sample from $P$, and define the additive and multiplicative Bayesian advantages as $\mathsf{Adv}_n^{+}(P, Q) = β_n(P,Q) - \frac{1}{n}$ and $\mathsf{Adv}_n^{\times}(P, Q) = n \cdot β_n(P,Q)$, respectively. We derive exact analytical expressions and asymptotic characterizations of $β_n(P, Q)$, along with evaluations in several representative scenarios. Furthermore, we establish (nearly) tight mutual bounds between the additive Bayesian advantage and the total variation distance. Finally, we extend our analysis beyond the basic setting and present, for the first time, an upper bound on the success probability of Bayesian attacks in shuffle differential privacy. Specifically, when the outputs of $n$ users -- each processed through an $\varepsilon$-differentially private local randomizer -- are shuffled, the probability that an attacker successfully re-identifies any target user's message is at most $e^{\varepsilon}/n$.
