Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

Enea Monzio Compagnoni; Alessandro Stanghellini; Rustem Islamov; Aurelien Lucchi; Anastasiia Koloskova

Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

Enea Monzio Compagnoni, Alessandro Stanghellini, Rustem Islamov, Aurelien Lucchi, Anastasiia Koloskova

TL;DR

This work revisits how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers and shows a sharp contrast under fixed hyperparameters.

Abstract

Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of $\mathcal{O}(1/\varepsilon^2)$ with speed independent of $\varepsilon$, while DP-SignSGD converges at a speed linear in $\varepsilon$ with an $\mathcal{O}(1/\varepsilon)$ trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with $\varepsilon$, while that of DP-SignSGD is essentially $\varepsilon$-independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and empirically extend from DP-SignSGD to DP-Adam.

Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

TL;DR

Abstract

with speed independent of

, while DP-SignSGD converges at a speed linear in

with an

trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with

, while that of DP-SignSGD is essentially

-independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and empirically extend from DP-SignSGD to DP-Adam.

Paper Structure (42 sections, 24 theorems, 164 equations, 13 figures, 1 table)

This paper contains 42 sections, 24 theorems, 164 equations, 13 figures, 1 table.

Introduction
Contributions.
Related Work
SDE approximations.
Differential privacy in optimization.
Adaptive DP optimizers.
Preliminaries
General Setup and Noise Assumptions.
SDE approximation.
Differential Privacy.
Theoretical Results
Notation.
Protocol A: Fixed Hyperparameters
DP-SGD: The privacy-utility trade-off
DP-SignSGD: The privacy-utility trade-off
...and 27 more sections

Key Result

Theorem 3.4

For $q=\frac{B}{n}$ where $B$ is the batch size, $n$ is the number of training points, and number of iterations $T$, $\exists c_1, c_2$ s.t. $\forall \varepsilon < c_1 q^2 T$, if the noise multiplier $\sigma_{\text{DP}}$ satisfies $\sigma_{\text{DP}} \geq c_2 \frac{q \sqrt{T \log (1/\delta)}}{ \vare

Figures (13)

Figure 1: Empirical validation of the privacy-utility trade-off predicted by Thm. \ref{['thm:lossbound_sgd']} and Thm. \ref{['thm:lossbound_sign']}, comparing DP-SGD, DP-SignSGD, and DP-Adam: Our focus is on verifying the functional dependence of the asymptotic loss levels in terms of $\varepsilon$. Left: On a quadratic convex function $f(x)=\tfrac{1}{2}x^\top Hx$, the observed empirical loss values perfectly match the theoretical predictions (Eq. \ref{['eq:ph2_dpsgd']}, Eq. \ref{['eq:ph2_dpsignsgd']}). Center and Right: Logistic regressions on the IMDB dataset (center) and the StackOverflow dataset (right), confirm the same pattern: the utility of DP-SGD scales as $\tfrac{1}{\varepsilon^2}$, while the utility of DP-SignSGD scales linearly as $\tfrac{1}{\varepsilon}$. Across all settings, we observe that the insights obtained for DP-SignSGD extend to DP-Adam as well as to the test loss (see Figure \ref{['fig:nm_scaling_test']}). For experimental details see Appendix \ref{['sec:fig1']}.
Figure 2: Empirical validation of the convergence speeds predicted by Thm. \ref{['thm:lossbound_sgd']} and Thm. \ref{['thm:lossbound_sign']}. We compare DP-SGD, DP-SignSGD, and DP-Adam as we train a logistic regression on the IMDB dataset (Top Row) and on the StackOverflow dataset (Bottom Row). In both tasks, we verify that when DP-SGD converges, its speed is unaffected by $\varepsilon$. As expected, it diverges when $\varepsilon$ is too small. Regarding DP-SignSGD and DP-Adam, they are faster when $\varepsilon$ is large and never diverge even when this is small. Crucially, Figure \ref{['fig:speed_test_loss']} shows that these insights are also verified on the test loss. For experimental details see Appendix \ref{['sec:fig2']}.
Figure 3: Logistic regression on IMDB Dataset: From left to right, we decrease the batch noise, i.e., increase the batch size, taking values $B \in \{48,56,64,72,80\}$: As per Theorem \ref{['thm:eps_star']}, the privacy threshold $\varepsilon^\star$ that determines when DP-SignSGD is more advantageous than DP-SGD shifts to the left. This confirms that if there is more noise due to the batch size, less privacy noise is needed for DP-SignSGD to be preferable over DP-SGD. For experimental details see Appendix \ref{['sec:fig3']}.
Figure 4: Empirical verification of Thm. \ref{['thm:sgd_opt']} and Thm. \ref{['thm:signsgd_opt']} under Protocol B on the IMDB dataset (Top Row) and on the StackOverflow dataset (Bottom Row). We tune $(\eta, C)$ of each optimizer for each $\varepsilon$ and confirm that: $i)$ all methods achieve comparable performance across privacy budgets; $ii)$ the optimal $\eta$ of DP-SGD scales linearly with $\varepsilon$, while that of adaptive methods is essentially $\varepsilon$-independent; $iii)$ failing to sweep over the "best" range of learning rates causes DP-SGD to severely underperform, whereas adaptive methods are resilient. On the left, DP-SGD degrades sharply for small $\varepsilon$. Indeed, the right panels shows that the selected optimal $\eta$ flattens out, while the theoretical one would have linearly decayed more: The "best" $\eta$ was simply missing from the grid. A posteriori, re-running the sweep with a larger grid (DP-SGD Tuned) recovers the scaling law and matches the performance of adaptive methods. For experimental details see Appendix \ref{['sec:fig4']}.
Figure A.1: Numerical validation of the approximation used in Equation \ref{['eq:hypgeomapprox']}. For several values of $d$, we plot the confluent hypergeometric function as a function of the signal-to-noise ratio $z$. In the realistic range observed in Malladi2022AdamSDE, approximating this function by $1$ is extremely accurate.
...and 8 more figures

Theorems & Definitions (50)

Definition 3.1
Definition 3.2
Definition 3.3
Theorem 3.4
Theorem 4.1
Theorem 4.2
Theorem 4.3
Theorem 4.4
Theorem 4.5
Theorem 4.6: DP-SGD
...and 40 more

Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

TL;DR

Abstract

Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (50)