Auditing Privacy Mechanisms via Label Inference Attacks

Róbert István Busa-Fekete; Travis Dick; Claudio Gentile; Andrés Muñoz Medina; Adam Smith; Marika Swanberg

Auditing Privacy Mechanisms via Label Inference Attacks

Róbert István Busa-Fekete, Travis Dick, Claudio Gentile, Andrés Muñoz Medina, Adam Smith, Marika Swanberg

TL;DR

This paper introduces reconstruction advantage measures to audit privacy mechanisms for private labels, defining additive and multiplicative leakage against distributional adversaries. It analyzes two PET families—randomized response (RR) and learning-from-label-proportions (LLP)—and their DP and non-DP variants, providing theoretical bounds that show how leakage scales with bag size and label correlations. Empirical results on synthetic data and real tasks (Higgs, KDD12) demonstrate that differentially private schemes commonly achieve equal or better privacy-utility tradeoffs than aggregation-based methods, while aggregation alone offers limited advantage. Overall, the framework enables fair, distribution-aware comparisons across privatization strategies and supports adopting DP-based label privatization in practice.

Abstract

We propose reconstruction advantage measures to audit label privatization mechanisms. A reconstruction advantage measure quantifies the increase in an attacker's ability to infer the true label of an unlabeled example when provided with a private version of the labels in a dataset (e.g., aggregate of labels from different users or noisy labels output by randomized response), compared to an attacker that only observes the feature vectors, but may have prior knowledge of the correlation between features and labels. We consider two such auditing measures: one additive, and one multiplicative. These incorporate previous approaches taken in the literature on empirical auditing and differential privacy. The measures allow us to place a variety of proposed privatization schemes -- some differentially private, some not -- on the same footing. We analyze these measures theoretically under a distributional model which encapsulates reasonable adversarial settings. We also quantify their behavior empirically on real and simulated prediction tasks. Across a range of experimental settings, we find that differentially private schemes dominate or match the privacy-utility tradeoff of more heuristic approaches.

Auditing Privacy Mechanisms via Label Inference Attacks

TL;DR

Abstract

Paper Structure (39 sections, 24 theorems, 135 equations, 5 figures)

This paper contains 39 sections, 24 theorems, 135 equations, 5 figures.

Introduction
Our contribution.
Prior Work.
Preliminaries
Learning from privatized labels.
Auditing Large-Scale Label Inference
Bounding the Additive Attack Advantage
Multiplicative Attack Advantage
Connection to prior work on auditing and membership inference, and distributional DP
Experiments
Mechanisms.
Estimating class conditionals for advantage and PET Utility.
Results.
Utility vs. advantage tradeoff on benchmark datasets.
Discussion and Conclusions
...and 24 more sections

Key Result

Theorem 3.2

Fix a data distribution $\mathcal{D}\xspace$, let $p = \mathop{\mathrm{\mathbb{P}}}\limits_{(x,y) \sim \mathcal{D}\xspace}(y = 1)$, and fix an arbitrary threshold $\beta \in [0,1/2]$. If labels are independent of features (i.e., $\mathcal{D}\xspace$ is a product of distributions over $\mathcal{X}$ a where $\Omega(\cdot)$ hides constants independent of $\beta$ and $k$.

Figures (5)

Figure 1: Prior-posterior scatter plots for LLP, RR, and LLP+Geom from two synthetic datasets (where the prior $\eta(x)$ is drawn) and the two real-world datasets (where $\eta(x)$ is approximated). The colors of the dots correspond to different parameter values for the PETs. For each bag size $k$ and distribution, we did 1000 independent runs. The further a point is from the $y=x$ dotted line, the more is revealed about its label as a result of the PET.
Figure 2: Top: Prior-posterior scatter plots for RR (grey), LLP (blue), and LLP+Geom (orange) with $\epsilon = 1$ and $k=8$ on the same datasets as in Figure \ref{['f:scatter_plots']}. With these choice of parameters, the three mechanisms roughly achieve the same AUC on Higgs. Middle: Empirical CDFs of (the absolute value of) the multiplicative advantage for the three PETs on the four datasets. Bottom: CDFs of the additive advantage.
Figure 3: Privacy vs utility tradeoff curves for the various PETs on the Higgs and KDD12 datasets. Utility is measured by AUC on test set, while privacy is either the additive measure (bottom row) or the 98th-percentile of the multiplicative measure (so as to rule out the infinite multiplicative advantage cases that can occur for LLP). Each point corresponds to a setting of the privacy parameter for the PET ($\epsilon$ for RR, $k$ for LLP, and both for LLP+Geom). The $x$-coordinate is the advantage (either additive or multiplicative) value for that PET, while the $y$-coordinate is the test AUC of a model trained from the output of that PET. The AUC of the model trained without a PET roughly corresponds to the top value achieved by these curves.
Figure 4: Prior-posterior scatter plots for LLP+Geom and LLP+Lap on two synthetic datasets and the two real-world datasets. The two synthetic datasets have been generated by drawing $\eta(x)$ from a Beta(2,30) distribution and a uniform distribution on $[0,1]$. The colors of the dots correspond to different parameter values for the PETs. For each bag size $k$ and distribution, we did 1000 independent runs. The further a point is from the $y=x$ dotted line, the more is revealed about its label as a result of the PET.
Figure 5: Privacy vs utility tradeoff curves for the various PETs on the Higgs (left) and KDD12 (right) datasets. Utility is measured by AUC on test set, while privacy is either the additive measure (bottom row) or the 98th-percentile of the multiplicative measure (so as to rule out the infinite multiplicative advantage cases that can occur for LLP). Each point corresponds to a setting of the privacy parameter for the PET ($\epsilon$ for RR, $k$ for LLP, and both for LLP+Geom). The $x$-coordinate is the advantage (either additive or multiplicative) value for that PET, while the $y$-coordinate is the test AUC of a model trained from the output of that PET. The AUC of the model trained without a PET roughly corresponds to the top value achieved by these curves. This plot is similar to \ref{['fig:privacy_vs_utility']} except that the curves for the LLP+Geom PET correspond to a fixed value of $\epsilon$, rather than a fixed value of $k$.

Theorems & Definitions (47)

Definition 2.1
Definition 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Definition 3.6
Theorem 3.7
Lemma A.1
proof
...and 37 more

Auditing Privacy Mechanisms via Label Inference Attacks

TL;DR

Abstract

Auditing Privacy Mechanisms via Label Inference Attacks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (47)