Table of Contents
Fetching ...

DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning

Jay Chooi, Kevin Cong, Russell Li, Lillian Sun

TL;DR

The paper investigates differentially private optimization by introducing DP-AdamW and DP-AdamW-BC, variants of AdamW that decouple weight decay and apply DP noise to gradients. It proves that both retain DP guarantees comparable to DP-SGD and provides convergence analyses under standard assumptions. Empirically, DP-AdamW yields consistent utility gains over DP-SGD, DP-Adam, and DP-AdamBC across image, text, and graph tasks, especially at tighter privacy (small $ε$), while DP-AdamW-BC often degrades performance in practice. Overall, the work demonstrates that decoupled weight decay substantially enhances private learning, with bias correction in DP-AdamW-BC offering limited or negative benefits in the tested settings, highlighting practical implications for DP optimizer design.

Abstract

As deep learning methods increasingly utilize sensitive data on a widespread scale, differential privacy (DP) offers formal guarantees to protect against information leakage during model training. A significant challenge remains in implementing DP optimizers that retain strong performance while preserving privacy. Recent advances introduced ever more efficient optimizers, with AdamW being a popular choice for training deep learning models because of strong empirical performance. We study \emph{DP-AdamW} and introduce \emph{DP-AdamW-BC}, a differentially private variant of the AdamW optimizer with DP bias correction for the second moment estimator. We start by showing theoretical results for privacy and convergence guarantees of DP-AdamW and DP-AdamW-BC. Then, we empirically analyze the behavior of both optimizers across multiple privacy budgets ($ε= 1, 3, 7$). We find that DP-AdamW outperforms existing state-of-the-art differentially private optimizers like DP-SGD, DP-Adam, and DP-AdamBC, scoring over 15\% higher on text classification, up to 5\% higher on image classification, and consistently 1\% higher on graph node classification. Moreover, we empirically show that incorporating bias correction in DP-AdamW (DP-AdamW-BC) consistently decreases accuracy, in contrast to the improvement of DP-AdamBC improvement over DP-Adam.

DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning

TL;DR

The paper investigates differentially private optimization by introducing DP-AdamW and DP-AdamW-BC, variants of AdamW that decouple weight decay and apply DP noise to gradients. It proves that both retain DP guarantees comparable to DP-SGD and provides convergence analyses under standard assumptions. Empirically, DP-AdamW yields consistent utility gains over DP-SGD, DP-Adam, and DP-AdamBC across image, text, and graph tasks, especially at tighter privacy (small ), while DP-AdamW-BC often degrades performance in practice. Overall, the work demonstrates that decoupled weight decay substantially enhances private learning, with bias correction in DP-AdamW-BC offering limited or negative benefits in the tested settings, highlighting practical implications for DP optimizer design.

Abstract

As deep learning methods increasingly utilize sensitive data on a widespread scale, differential privacy (DP) offers formal guarantees to protect against information leakage during model training. A significant challenge remains in implementing DP optimizers that retain strong performance while preserving privacy. Recent advances introduced ever more efficient optimizers, with AdamW being a popular choice for training deep learning models because of strong empirical performance. We study \emph{DP-AdamW} and introduce \emph{DP-AdamW-BC}, a differentially private variant of the AdamW optimizer with DP bias correction for the second moment estimator. We start by showing theoretical results for privacy and convergence guarantees of DP-AdamW and DP-AdamW-BC. Then, we empirically analyze the behavior of both optimizers across multiple privacy budgets (). We find that DP-AdamW outperforms existing state-of-the-art differentially private optimizers like DP-SGD, DP-Adam, and DP-AdamBC, scoring over 15\% higher on text classification, up to 5\% higher on image classification, and consistently 1\% higher on graph node classification. Moreover, we empirically show that incorporating bias correction in DP-AdamW (DP-AdamW-BC) consistently decreases accuracy, in contrast to the improvement of DP-AdamBC improvement over DP-Adam.

Paper Structure

This paper contains 26 sections, 3 theorems, 26 equations, 3 figures, 7 tables, 2 algorithms.

Key Result

Theorem 3.1

Suppose that the DP-SGD optimizer $DP-SGD(\theta, X, y, C, \sigma, B)$ satisfies $(\epsilon, \delta)$-DP with privacy analysis $\phi(T, \theta_i)$. Then both $DP-AdamW(\theta, X, y, C, \sigma, B)$ and $DP-AdamW-BC(\theta, X, y, C, \sigma, B)$ satisfy $(\epsilon, \delta)$-DP with the same privacy ana

Figures (3)

  • Figure 1: Training CIFAR-10 for $\epsilon = 1$ across learning rates for DP-AdamW (left) and DP-AdamW-BC (right), with step on x-axis and training loss on y-axis
  • Figure 2: Evaluating on CIFAR-10 for $\epsilon = 1$ across learning rates for DP-AdamW (left) and DP-AdamW-BC (right), with step on x-axis and test accuracy (proportion) on y-axis
  • Figure 3: Losses when using DP-AdamW under $\epsilon=3$

Theorems & Definitions (12)

  • Theorem 3.1: cf. Proposition 1 of tang_dp-adambc_2023
  • Remark
  • Theorem 3.5
  • Remark
  • Theorem 3.6
  • Remark
  • proof : Proof of Theorem \ref{['privacy_guarantee']}
  • Remark
  • proof : Proof of Theorem \ref{['conv1']}
  • Remark
  • ...and 2 more