Table of Contents
Fetching ...

The Cost of Shuffling in Private Gradient Based Optimization

Shuli Jiang, Pranay Sharma, Zhiwei Steven Wu, Gauri Joshi

TL;DR

The paper analyzes differentially private convex ERM solved via shuffled gradient methods and shows that DP-ShuffleG suffers worse empirical excess risk than DP-SGD due to reduced randomness under privacy. To mitigate this, it introduces a Generalized Shuffled Gradient Framework with surrogate objectives, adaptive noise, and a dissimilarity measure, enabling convergence analysis that accounts for surrogate-epoch differences. It then proposes Interleaved-ShuffleG, which interleaves private and public data within each epoch to leverage cheap public data while maintaining privacy, and provides a rigorous convergence/privacy treatment via privacy amplification by iteration (PABI) and Stein's lemma. Empirical results on diverse tasks demonstrate that Interleaved-ShuffleG consistently achieves lower empirical excess risk than DP-ShuffleG and public-data baselines, especially under strong privacy constraints, highlighting a practical route to improve private shuffled optimization.

Abstract

We consider the problem of differentially private (DP) convex empirical risk minimization (ERM). While the standard DP-SGD algorithm is theoretically well-established, practical implementations often rely on shuffled gradient methods that traverse the training data sequentially rather than sampling with replacement in each iteration. Despite their widespread use, the theoretical privacy-accuracy trade-offs of private shuffled gradient methods (\textit{DP-ShuffleG}) remain poorly understood, leading to a gap between theory and practice. In this work, we leverage privacy amplification by iteration (PABI) and a novel application of Stein's lemma to provide the first empirical excess risk bound of \textit{DP-ShuffleG}. Our result shows that data shuffling results in worse empirical excess risk for \textit{DP-ShuffleG} compared to DP-SGD. To address this limitation, we propose \textit{Interleaved-ShuffleG}, a hybrid approach that integrates public data samples in private optimization. By alternating optimization steps that use private and public samples, \textit{Interleaved-ShuffleG} effectively reduces empirical excess risk. Our analysis introduces a new optimization framework with surrogate objectives, adaptive noise injection, and a dissimilarity metric, which can be of independent interest. Our experiments on diverse datasets and tasks demonstrate the superiority of \textit{Interleaved-ShuffleG} over several baselines.

The Cost of Shuffling in Private Gradient Based Optimization

TL;DR

The paper analyzes differentially private convex ERM solved via shuffled gradient methods and shows that DP-ShuffleG suffers worse empirical excess risk than DP-SGD due to reduced randomness under privacy. To mitigate this, it introduces a Generalized Shuffled Gradient Framework with surrogate objectives, adaptive noise, and a dissimilarity measure, enabling convergence analysis that accounts for surrogate-epoch differences. It then proposes Interleaved-ShuffleG, which interleaves private and public data within each epoch to leverage cheap public data while maintaining privacy, and provides a rigorous convergence/privacy treatment via privacy amplification by iteration (PABI) and Stein's lemma. Empirical results on diverse tasks demonstrate that Interleaved-ShuffleG consistently achieves lower empirical excess risk than DP-ShuffleG and public-data baselines, especially under strong privacy constraints, highlighting a practical route to improve private shuffled optimization.

Abstract

We consider the problem of differentially private (DP) convex empirical risk minimization (ERM). While the standard DP-SGD algorithm is theoretically well-established, practical implementations often rely on shuffled gradient methods that traverse the training data sequentially rather than sampling with replacement in each iteration. Despite their widespread use, the theoretical privacy-accuracy trade-offs of private shuffled gradient methods (\textit{DP-ShuffleG}) remain poorly understood, leading to a gap between theory and practice. In this work, we leverage privacy amplification by iteration (PABI) and a novel application of Stein's lemma to provide the first empirical excess risk bound of \textit{DP-ShuffleG}. Our result shows that data shuffling results in worse empirical excess risk for \textit{DP-ShuffleG} compared to DP-SGD. To address this limitation, we propose \textit{Interleaved-ShuffleG}, a hybrid approach that integrates public data samples in private optimization. By alternating optimization steps that use private and public samples, \textit{Interleaved-ShuffleG} effectively reduces empirical excess risk. Our analysis introduces a new optimization framework with surrogate objectives, adaptive noise injection, and a dissimilarity metric, which can be of independent interest. Our experiments on diverse datasets and tasks demonstrate the superiority of \textit{Interleaved-ShuffleG} over several baselines.

Paper Structure

This paper contains 39 sections, 29 theorems, 167 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Under Assumptions ass:convexity, ass:smoothness, ass:reg, ass:H_smoothness, ass:dissim_partial_lipschitzness, for $\beta > 0$, if $\mu_{\psi} \geq L_H^{(s)} + \beta$, $\forall s\in [K]$, and $\eta \lesssim \frac{1}{n L^* \sqrt{1+\log K}}$, Algorithm alg:generalized_shuffled_gradient_fm guarantees and the expectation is taken w.r.t. the injected noise $\{\rho_i^{(s)}\}$ and the order of samples $\

Figures (6)

  • Figure 1: Illustration of algorithms that use public data.
  • Figure 2: Results on each dataset across different tasks. Each algorithm runs for $K=50$ epochs, with privacy loss ${\epsilon} \in \{1, 5, 10\}$ and $\delta=10^{-6}$. The solid lines represent the mean performance, while the shaded regions denote one std. across 10 random runs.
  • Figure 3: Results of comparing IG-based algorithms on two datasets.
  • Figure 4: Results of comparing SO-based algorithms on two datasets.
  • Figure 5: Results of using different fractions of private samples for $p\in \{0.25, 0.75\}$ on dataset CreditCard.
  • ...and 1 more figures

Theorems & Definitions (50)

  • Definition 1: Differential Privacy (DP) dwork2014algorithmic
  • Theorem 1: Convergence of Generalized Shuffled Gradient Framework
  • Corollary 2: Convergence of $\textit{DP-ShuffleG}$
  • Lemma 1: Privacy of $\textit{DP-ShuffleG}$
  • Definition 2: Differential Privacy (DP) dwork2014algorithmic
  • Definition 3: Renyi Divergence
  • Definition 4: $(\alpha, {\epsilon})$-Renyi Differential Privacy (RDP) Mironov2017rdp
  • Proposition 3: From RDP to DP (Proposition 3 of Mironov2017rdp)
  • Proposition 4: RDP Composition (Proposition 1 of Mironov2017rdp)
  • Definition 5: Contraction (Definition 16 of Feldman2018privacy_amp_iter)
  • ...and 40 more