Almost Sure Convergence Analysis of Differentially Private Stochastic Gradient Methods
Amartya Mukherjee, Jun Liu
TL;DR
This work addresses the question of whether DP-SGD and its momentum-enhanced variants converge on a per-run basis, not just in expectation. By leveraging a supermartingale framework and carefully bounding the bias and noise introduced by gradient clipping and Gaussian perturbations, the authors establish almost-sure convergence for DP-SGD, DP-SHB, and DP-NAG under standard $L$-smooth and, when applicable, strong convexity assumptions, with decaying step sizes $\alpha_t=\Theta(1/t^{1-\theta})$, $\theta\in(0,1/2)$. They show that a weighted gradient-descent quantity $\Phi_t$ (and its strong-convex counterpart $\Phi_t^\mu$) satisfies $\min_{1\le i\le t} \Phi_i(\mathbf{x}_i) = o((\sum_{i=1}^{t-1} \alpha_i)^{-1})$ a.s., and, via Orabona’s lemma, that last iterates converge: $\nabla f(\mathbf{x}_t) \to 0$ a.s. for DP-SHB and DP-NAG (and under analogous conditions for DP-SGD). These results provide pathwise guarantees for privacy-preserving stochastic optimization, strengthening the theoretical foundations for deploying DP-SGD in practice. They also outline future work to derive explicit rates that depend on the clipping parameter and privacy noise.
Abstract
Differentially private stochastic gradient descent (DP-SGD) has become the standard algorithm for training machine learning models with rigorous privacy guarantees. Despite its widespread use, the theoretical understanding of its long-run behavior remains limited: existing analyses typically establish convergence in expectation or with high probability, but do not address the almost sure convergence of single trajectories. In this work, we prove that DP-SGD converges almost surely under standard smoothness assumptions, both in nonconvex and strongly convex settings, provided the step sizes satisfy some standard decaying conditions. Our analysis extends to momentum variants such as the stochastic heavy ball (DP-SHB) and Nesterov's accelerated gradient (DP-NAG), where we show that careful energy constructions yield similar guarantees. These results provide stronger theoretical foundations for differentially private optimization and suggest that, despite privacy-induced distortions, the algorithm remains pathwise stable in both convex and nonconvex regimes.
