Table of Contents
Fetching ...

Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach

Xinwei Zhang, Zhiqi Bu, Zhiwei Steven Wu, Mingyi Hong

TL;DR

DiceSGD tackles the clipping bias problem in differentially private SGD by incorporating a clipped error-feedback mechanism that debiases the gradient updates while preserving $(\epsilon,\delta)$-DP via Gaussian noise. The method provides convergence guarantees for non-convex, Lipschitz-smooth objectives using a tailored Rényi-DP analysis and yields a utility bound of $\mathcal{O}(1/\sqrt{T})$, with DP noise that is slightly larger due to the non-privatized error state. The paper proves that clipping thresholds can be chosen independently of problem constants, circumventing the traditional clipping-tuning issue, and demonstrates superior empirical performance over DPSGD-GC on CIFAR-10/100 and E2E GPT-2 tasks. Overall, DiceSGD offers a practical, principled approach to private training that preserves performance while maintaining strong privacy guarantees.

Abstract

Differentially Private Stochastic Gradient Descent with Gradient Clipping (DPSGD-GC) is a powerful tool for training deep learning models using sensitive data, providing both a solid theoretical privacy guarantee and high efficiency. However, using DPSGD-GC to ensure Differential Privacy (DP) comes at the cost of model performance degradation due to DP noise injection and gradient clipping. Existing research has extensively analyzed the theoretical convergence of DPSGD-GC, and has shown that it only converges when using large clipping thresholds that are dependent on problem-specific parameters. Unfortunately, these parameters are often unknown in practice, making it hard to choose the optimal clipping threshold. Therefore, in practice, DPSGD-GC suffers from degraded performance due to the {\it constant} bias introduced by the clipping. In our work, we propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC, which not only offers a diminishing utility bound without inducing a constant clipping bias, but more importantly, it allows for an arbitrary choice of clipping threshold that is independent of the problem. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R{é}nyi DP. Additionally, we demonstrate that under mild conditions, our algorithm can achieve nearly the same utility bound as DPSGD without gradient clipping. Our empirical results on Cifar-10/100 and E2E datasets, show that the proposed algorithm achieves higher accuracies than DPSGD while maintaining the same level of DP guarantee.

Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach

TL;DR

DiceSGD tackles the clipping bias problem in differentially private SGD by incorporating a clipped error-feedback mechanism that debiases the gradient updates while preserving -DP via Gaussian noise. The method provides convergence guarantees for non-convex, Lipschitz-smooth objectives using a tailored Rényi-DP analysis and yields a utility bound of , with DP noise that is slightly larger due to the non-privatized error state. The paper proves that clipping thresholds can be chosen independently of problem constants, circumventing the traditional clipping-tuning issue, and demonstrates superior empirical performance over DPSGD-GC on CIFAR-10/100 and E2E GPT-2 tasks. Overall, DiceSGD offers a practical, principled approach to private training that preserves performance while maintaining strong privacy guarantees.

Abstract

Differentially Private Stochastic Gradient Descent with Gradient Clipping (DPSGD-GC) is a powerful tool for training deep learning models using sensitive data, providing both a solid theoretical privacy guarantee and high efficiency. However, using DPSGD-GC to ensure Differential Privacy (DP) comes at the cost of model performance degradation due to DP noise injection and gradient clipping. Existing research has extensively analyzed the theoretical convergence of DPSGD-GC, and has shown that it only converges when using large clipping thresholds that are dependent on problem-specific parameters. Unfortunately, these parameters are often unknown in practice, making it hard to choose the optimal clipping threshold. Therefore, in practice, DPSGD-GC suffers from degraded performance due to the {\it constant} bias introduced by the clipping. In our work, we propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC, which not only offers a diminishing utility bound without inducing a constant clipping bias, but more importantly, it allows for an arbitrary choice of clipping threshold that is independent of the problem. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R{é}nyi DP. Additionally, we demonstrate that under mild conditions, our algorithm can achieve nearly the same utility bound as DPSGD without gradient clipping. Our empirical results on Cifar-10/100 and E2E datasets, show that the proposed algorithm achieves higher accuracies than DPSGD while maintaining the same level of DP guarantee.
Paper Structure (30 sections, 13 theorems, 73 equations, 5 figures, 4 tables, 4 algorithms)

This paper contains 30 sections, 13 theorems, 73 equations, 5 figures, 4 tables, 4 algorithms.

Key Result

Theorem 2.3

Given $N,B,T$ and $C$, there exist positive constants $u,v$, such that for any $\epsilon<\frac{uB^2T}{N^2}, \delta>0$, by choosing $\sigma_1^2\geq v\frac{C^2T\ln(\frac{1}{\delta})}{N^2\epsilon^2}$, Algorithm alg:dpsgd is guaranteed to be $(\epsilon,\delta)$-DP.

Figures (5)

  • Figure 1: The flow diagram of DiceSGD. The clipped EF components are highlighted in red, and DP components are marked in yellow. $z^{-1}$ denotes the unit delay.
  • Figure 2: The testing accuracy for Cifar-10 and Cifar-100 trained with DiceSGD under different $C_1, C_2$ with fixed effective stepsize.
  • Figure 3: The testing accuracy for Cifar-10 and Cifar-100 trained with DiceSGD under different $C_1, \eta$ with fixed $C_2 = C_1$.
  • Figure 4: Testing loss of DPSGD and DiceSGD fine-tuning GPT-2 on E2E dataset, with clipping thresholds $C = C_1 = C_2 = 1$ and guarantees $(8,8\times 10^{-6})$-DP.
  • Figure 5: Testing loss (smaller the better) of DiceSGD on E2E dataset with different combinations of clipping thresholds and initial stepsizes.

Theorems & Definitions (21)

  • Definition 2.1: $\epsilon,\delta$-DP dwork2006differential
  • Definition 2.2: Gaussian Mechanism dwork2006differential
  • Theorem 2.3: Theorem 1 abadi2016deep
  • Theorem 3.6
  • Theorem 3.7
  • Corollary 3.8
  • Lemma A.1
  • Theorem A.2
  • Theorem A.3
  • Definition A.4: Rényi-DP mironov2017renyi
  • ...and 11 more