Table of Contents
Fetching ...

Optimizer Dynamics at the Edge of Stability with Differential Privacy

Ayana Hussain, Ricky Fang

TL;DR

This work investigates how differential privacy—implemented via per-example gradient clipping and Gaussian noise—reshapes the optimization dynamics of neural networks. By comparing full-batch GD and Adam with their DP variants on CIFAR-10, the authors track raw and preconditioned sharpness to assess persistence of Edge of Stability (EoS), Edge of Stochastic Stability (EoSS), and Adaptive Edge of Stability (AEoS) regimes. Key findings show that DP generally reduces sharpness and can prevent crossing classical stability thresholds, yet EoS-like patterns persist under certain conditions, with large learning rates and privacy budgets approaching these thresholds. Under DP, DP-GD often stabilizes at finite sharpness levels that depend on ε, while DP-Adam tends to couple raw and preconditioned sharpness at lower values, failing to reach non-private AEoS thresholds; overall, DP induces distinct, privacy-dependent stability regimes and slower convergence, highlighting the need for DP-aware optimization strategies.

Abstract

Deep learning models can reveal sensitive information about individual training examples, and while differential privacy (DP) provides guarantees restricting such leakage, it also alters optimization dynamics in poorly understood ways. We study the training dynamics of neural networks under DP by comparing Gradient Descent (GD), and Adam to their privacy-preserving variants. Prior work shows that these optimizers exhibit distinct stability dynamics: full-batch methods train at the Edge of Stability (EoS), while mini-batch and adaptive methods exhibit analogous edge-of-stability behavior. At these regimes, the training loss and the sharpness--the maximum eigenvalue of the training loss Hessian--exhibit certain characteristic behavior. In DP training, per-example gradient clipping and Gaussian noise modify the update rule, and it is unclear whether these stability patterns persist. We analyze how clipping and noise change sharpness and loss evolution and show that while DP generally reduces the sharpness and can prevent optimizers from fully reaching the classical stability thresholds, patterns from EoS and analogous adaptive methods stability regimes persist, with the largest learning rates and largest privacy budgets approaching, and sometimes exceeding, these thresholds. These findings highlight the unpredictability introduced by DP in neural network optimization.

Optimizer Dynamics at the Edge of Stability with Differential Privacy

TL;DR

This work investigates how differential privacy—implemented via per-example gradient clipping and Gaussian noise—reshapes the optimization dynamics of neural networks. By comparing full-batch GD and Adam with their DP variants on CIFAR-10, the authors track raw and preconditioned sharpness to assess persistence of Edge of Stability (EoS), Edge of Stochastic Stability (EoSS), and Adaptive Edge of Stability (AEoS) regimes. Key findings show that DP generally reduces sharpness and can prevent crossing classical stability thresholds, yet EoS-like patterns persist under certain conditions, with large learning rates and privacy budgets approaching these thresholds. Under DP, DP-GD often stabilizes at finite sharpness levels that depend on ε, while DP-Adam tends to couple raw and preconditioned sharpness at lower values, failing to reach non-private AEoS thresholds; overall, DP induces distinct, privacy-dependent stability regimes and slower convergence, highlighting the need for DP-aware optimization strategies.

Abstract

Deep learning models can reveal sensitive information about individual training examples, and while differential privacy (DP) provides guarantees restricting such leakage, it also alters optimization dynamics in poorly understood ways. We study the training dynamics of neural networks under DP by comparing Gradient Descent (GD), and Adam to their privacy-preserving variants. Prior work shows that these optimizers exhibit distinct stability dynamics: full-batch methods train at the Edge of Stability (EoS), while mini-batch and adaptive methods exhibit analogous edge-of-stability behavior. At these regimes, the training loss and the sharpness--the maximum eigenvalue of the training loss Hessian--exhibit certain characteristic behavior. In DP training, per-example gradient clipping and Gaussian noise modify the update rule, and it is unclear whether these stability patterns persist. We analyze how clipping and noise change sharpness and loss evolution and show that while DP generally reduces the sharpness and can prevent optimizers from fully reaching the classical stability thresholds, patterns from EoS and analogous adaptive methods stability regimes persist, with the largest learning rates and largest privacy budgets approaching, and sometimes exceeding, these thresholds. These findings highlight the unpredictability introduced by DP in neural network optimization.

Paper Structure

This paper contains 42 sections, 1 theorem, 9 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Let $K_1$ be an $\varepsilon_1$-differentially private mechanism and $K_2$ be an $\varepsilon_2$-differentially private mechanism. Then the sequential composition $(K_1, K_2)$ satisfies $(\varepsilon_1 + \varepsilon_2)$-differential privacy.

Figures (5)

  • Figure 1: Sharpness and preconditioned sharpness under full-batch (FB) and DP training.Top row: GD and DP-GD sharpness. For standard GD, sharpness across LR's replicates the Edge of Stability (EoS), with each LR’s stability threshold shown as a solid line in the same color and a lightly colored line for the training loss. Larger LR's reach the threshold faster. Under DP-GD, the largest LR's generally stabilize, particularly for high epsilon, but sharpness often exceeds the normal threshold. Bottom row and top row right: Adam and DP-Adam sharpness and preconditioned sharpness. FB Adam exhibits most Adaptive Edge of Stability (AEoS) behavior, however, larger LR's reach the adaptive stability threshold while the smallest one does not. For DP-Adam, sharpness and preconditioned sharpness appear to stabilize together; most LR's and epsilons do not reach the AEoS threshold, although the largest LR and epsilon approach it. These results illustrate how DP modifies sharpness dynamics and underscore the unpredictability of DP training.
  • Figure 2: DP-GD sharpness for the largest learning rate Sharpness trajectories for DP-GD ($\eta=0.1$) across privacy budgets $\varepsilon$. Solid, dark curves show sharpness, lighter curves show training loss, and black denotes the non-DP baseline. The dotted line marks the stability threshold $2/\eta$. Higher $\varepsilon$ yields larger stabilized sharpness, indicating an $\varepsilon$-dependent edge-of-stability regime. If sharpness flattening is interpreted as an edge-of-stability (EoS) behavior, a breakeven point naturally emerges at the point where sharpness stops increasing and begins to stabilize (possibly creating a new DP-induced EoS).
  • Figure 3: DP-GD sharpness across smaller learning rates. Sharpness dynamics for DP-GD under the same experimental setup as the previous figure, shown for three learning rates: $\eta = 0.04$ (left), $\eta = 0.0182$ (middle), and $\eta = 0.025$ (right). Across all learning rate and privacy budget $(\eta,\varepsilon)$ pairs, we observe progressive sharpening, with larger $\varepsilon$ consistently yielding higher sharpness. Flattening behavior is partially observed only for $\eta = 2/50$ at $\varepsilon = 16$ (but could be due to randomness); for other configurations, convergence is slower and no definitive edge-of-stability regime can be identified. Training losses for DP runs remain largely monotonic, further indicating slow convergence in these settings.
  • Figure 4: DP-Adam sharpness for the largest learning rateTop: Preconditioned sharpness shows progressive sharpening across $\epsilon \in \{16, 32, 64\}$, with lower $\epsilon$ stabilizing at lower values. The black dotted line marks the Adam stability threshold $38/\eta$; no $\epsilon$ reaches this threshold. The thick black "Non-DP" curve (no noise) shows baseline behavior. All loss curves (right $y$-axis) are light solid lines matching their sharpness colors. Bottom: Raw sharpness exhibits similar $\epsilon$-dependent flattening. Progressive sharpening persists for preconditioned sharpness but plateaus after breakeven (flattening onset), with reduced oscillatory instability compared to non-private Adam. This suggests a new potential AEoS under DP where breakeven marks stability.
  • Figure 5: DP-Adam sharpness for smaller learning rates. Sharpness dynamics for DP-Adam under the same experimental setup as previous figures, shown for learning rates $\eta \in \{3\times10^{-5}, 10^{-4}, 3\times10^{-4}\}$. Panels (a–c) show preconditioned sharpness with the corresponding theoretical stability threshold, while panels (d–f) show raw sharpness. All plots include the non-DP baseline and DP runs for $\varepsilon \in \{16, 32, 64\}$, with training losses overlaid on a secondary axis. Progressive sharpening is observed across all $(\eta, \varepsilon)$ pairs, with higher $\varepsilon$ consistently producing higher sharpness. No configuration reaches the non-DP stability threshold, and clear flattening is only observed at larger learning rates, while smaller $\eta$ exhibit slow convergence with largely monotonic losses.

Theorems & Definitions (1)

  • Theorem 1: Sequential composition; dwork2010differential