Table of Contents
Fetching ...

Scalable DP-SGD: Shuffling vs. Poisson Subsampling

Lynn Chua, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang

TL;DR

The utility of models trained with Poisson-subsampling-based DP-SGD, and the optimistic estimates of utility when using shuffling are compared, via new lower bounds on the privacy guarantee of ABLQ with shuffling.

Abstract

We provide new lower bounds on the privacy guarantee of the multi-epoch Adaptive Batch Linear Queries (ABLQ) mechanism with shuffled batch sampling, demonstrating substantial gaps when compared to Poisson subsampling; prior analysis was limited to a single epoch. Since the privacy analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) is obtained by analyzing the ABLQ mechanism, this brings into serious question the common practice of implementing shuffling-based DP-SGD, but reporting privacy parameters as if Poisson subsampling was used. To understand the impact of this gap on the utility of trained machine learning models, we introduce a practical approach to implement Poisson subsampling at scale using massively parallel computation, and efficiently train models with the same. We compare the utility of models trained with Poisson-subsampling-based DP-SGD, and the optimistic estimates of utility when using shuffling, via our new lower bounds on the privacy guarantee of ABLQ with shuffling.

Scalable DP-SGD: Shuffling vs. Poisson Subsampling

TL;DR

The utility of models trained with Poisson-subsampling-based DP-SGD, and the optimistic estimates of utility when using shuffling are compared, via new lower bounds on the privacy guarantee of ABLQ with shuffling.

Abstract

We provide new lower bounds on the privacy guarantee of the multi-epoch Adaptive Batch Linear Queries (ABLQ) mechanism with shuffled batch sampling, demonstrating substantial gaps when compared to Poisson subsampling; prior analysis was limited to a single epoch. Since the privacy analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) is obtained by analyzing the ABLQ mechanism, this brings into serious question the common practice of implementing shuffling-based DP-SGD, but reporting privacy parameters as if Poisson subsampling was used. To understand the impact of this gap on the utility of trained machine learning models, we introduce a practical approach to implement Poisson subsampling at scale using massively parallel computation, and efficiently train models with the same. We compare the utility of models trained with Poisson-subsampling-based DP-SGD, and the optimistic estimates of utility when using shuffling, via our new lower bounds on the privacy guarantee of ABLQ with shuffling.

Paper Structure

This paper contains 20 sections, 7 theorems, 14 equations, 7 figures, 4 algorithms.

Key Result

Theorem 3.1

For all $\sigma > 0$, $\varepsilon \ge 0$, and valid $n$, $b$, $T$, it holds that and $\Phi(\cdot)$ is the cumulative density function (CDF) of the standard normal random variable $\mathcal{N}(0, 1)$.

Figures (7)

  • Figure 1: Various natural instantiations of the permutation batch sampler.
  • Figure 2: Visualization of the massively parallel computation approach for Poisson subsampling at scale. Consider $6$ records $x_1, \ldots, x_6$ sub-sampled into $4$ batches with a maximum batch size of $B=2$. The Map operation adds a "weight" parameter of $1$ to all examples, and samples indices of batches to which each example will belong. The Reduce operation groups by the batch indices. The final Map operation truncates batches with more than $B$ examples (e.g., batches $1$ and $3$ above), and pads dummy examples with weight $0$ in batches with fewer than $B$ examples (e.g., batch $4$ above).
  • Figure 3: AUC (left) and bounds on $\sigma_{\mathcal{B}}$ values (middle) for $\varepsilon = 5, \delta=2.7 \cdot 10^{-8}$ and using $1$ epoch (top) and $5$ epochs (bottom) of training on a linear-log scale; AUC (right) is with non-private training.
  • Figure 4: AUC (left) and $\sigma$ values (right) with varying $\varepsilon$, fixing $\delta = 2.7 \cdot 10^{-8}$ and using (top) $1$ epoch and (bottom) $5$ epochs of training. $\sigma$ is in log scale to highlight the differences at high $\varepsilon$.
  • Figure 5: Comparison of an optimistic estimate of the upper bound on $\sigma_{\mathcal{S}}$ from feldman21hiding against the lower bound on $\sigma_{\mathcal{S}}$ in \ref{['thm:SP-hockey-lower']} and $\sigma_{\mathcal{D}}$.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 2.1: DP
  • Definition 2.2: Dominating Pair zhu22optimal
  • Theorem 3.1: balle18improving
  • Proposition 3.2
  • proof
  • Theorem 3.3
  • Proposition 3.4
  • Theorem 3.5
  • Proposition 3.6: DP Post-processing dwork14algorithmic
  • Theorem 3.7