Optimizer Dynamics at the Edge of Stability with Differential Privacy
Ayana Hussain, Ricky Fang
TL;DR
This work investigates how differential privacy—implemented via per-example gradient clipping and Gaussian noise—reshapes the optimization dynamics of neural networks. By comparing full-batch GD and Adam with their DP variants on CIFAR-10, the authors track raw and preconditioned sharpness to assess persistence of Edge of Stability (EoS), Edge of Stochastic Stability (EoSS), and Adaptive Edge of Stability (AEoS) regimes. Key findings show that DP generally reduces sharpness and can prevent crossing classical stability thresholds, yet EoS-like patterns persist under certain conditions, with large learning rates and privacy budgets approaching these thresholds. Under DP, DP-GD often stabilizes at finite sharpness levels that depend on ε, while DP-Adam tends to couple raw and preconditioned sharpness at lower values, failing to reach non-private AEoS thresholds; overall, DP induces distinct, privacy-dependent stability regimes and slower convergence, highlighting the need for DP-aware optimization strategies.
Abstract
Deep learning models can reveal sensitive information about individual training examples, and while differential privacy (DP) provides guarantees restricting such leakage, it also alters optimization dynamics in poorly understood ways. We study the training dynamics of neural networks under DP by comparing Gradient Descent (GD), and Adam to their privacy-preserving variants. Prior work shows that these optimizers exhibit distinct stability dynamics: full-batch methods train at the Edge of Stability (EoS), while mini-batch and adaptive methods exhibit analogous edge-of-stability behavior. At these regimes, the training loss and the sharpness--the maximum eigenvalue of the training loss Hessian--exhibit certain characteristic behavior. In DP training, per-example gradient clipping and Gaussian noise modify the update rule, and it is unclear whether these stability patterns persist. We analyze how clipping and noise change sharpness and loss evolution and show that while DP generally reduces the sharpness and can prevent optimizers from fully reaching the classical stability thresholds, patterns from EoS and analogous adaptive methods stability regimes persist, with the largest learning rates and largest privacy budgets approaching, and sometimes exceeding, these thresholds. These findings highlight the unpredictability introduced by DP in neural network optimization.
