DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning
Jay Chooi, Kevin Cong, Russell Li, Lillian Sun
TL;DR
The paper investigates differentially private optimization by introducing DP-AdamW and DP-AdamW-BC, variants of AdamW that decouple weight decay and apply DP noise to gradients. It proves that both retain DP guarantees comparable to DP-SGD and provides convergence analyses under standard assumptions. Empirically, DP-AdamW yields consistent utility gains over DP-SGD, DP-Adam, and DP-AdamBC across image, text, and graph tasks, especially at tighter privacy (small $ε$), while DP-AdamW-BC often degrades performance in practice. Overall, the work demonstrates that decoupled weight decay substantially enhances private learning, with bias correction in DP-AdamW-BC offering limited or negative benefits in the tested settings, highlighting practical implications for DP optimizer design.
Abstract
As deep learning methods increasingly utilize sensitive data on a widespread scale, differential privacy (DP) offers formal guarantees to protect against information leakage during model training. A significant challenge remains in implementing DP optimizers that retain strong performance while preserving privacy. Recent advances introduced ever more efficient optimizers, with AdamW being a popular choice for training deep learning models because of strong empirical performance. We study \emph{DP-AdamW} and introduce \emph{DP-AdamW-BC}, a differentially private variant of the AdamW optimizer with DP bias correction for the second moment estimator. We start by showing theoretical results for privacy and convergence guarantees of DP-AdamW and DP-AdamW-BC. Then, we empirically analyze the behavior of both optimizers across multiple privacy budgets ($ε= 1, 3, 7$). We find that DP-AdamW outperforms existing state-of-the-art differentially private optimizers like DP-SGD, DP-Adam, and DP-AdamBC, scoring over 15\% higher on text classification, up to 5\% higher on image classification, and consistently 1\% higher on graph node classification. Moreover, we empirically show that incorporating bias correction in DP-AdamW (DP-AdamW-BC) consistently decreases accuracy, in contrast to the improvement of DP-AdamBC improvement over DP-Adam.
