Dual Averaging Converges for Nonconvex Smooth Stochastic Optimization
Tuo Liu, El Mehdi Saad, Wojciech Kotłowski, Francesco Orabona
TL;DR
This paper provides the first iterate-level convergence guarantees for stochastic Dual Averaging (SDA) on smooth, potentially non-convex objectives, showing rates matching SGD up to a $\log T$ factor: $O\big(1/T + \sigma\log T/\sqrt{T}\big)$ under a strong-growth condition with deterministic steps, and a high-probability bound under sub-Gaussian noise. A key technique is interpreting SDA as SGD on a time-varying sequence of regularized functions $f_t(x) = f(x) + \frac{\gamma_t}{2}\|x\|^2$, enabling a generalized descent framework. The authors also introduce ADA-DA, an AdaGrad-style adaptive variant that achieves the same rates without knowing the noise variance, provided iterates stay bounded. Together, these results close a long-standing open problem by delivering complete convergence theory for dual averaging in non-convex stochastic settings, while identifying practical limitations and directions for fully adaptive, iterate-independent guarantees. The work advances understanding of first-order methods in non-convex stochastic optimization and broadens the theoretical foundation for dual averaging in modern machine learning tasks.
Abstract
Dual averaging and gradient descent with their stochastic variants stand as the two canonical recipe books for first-order optimization: Every modern variant can be viewed as a descendant of one or the other. In the convex regime, these algorithms have been deeply studied, and we know that they are essentially equivalent in terms of theoretical guarantees. On the other hand, in the non-convex setting, the situation is drastically different: While we know that SGD can minimize the gradient of non-convex smooth functions, no finite-time complexity guarantee for Stochastic Dual Averaging (SDA) was known in the same setting. In this paper, we close this gap by a reduction that views SDA as SGD applied to a sequence of implicitly regularized objectives. We show that a tuned SDA exhibits a rate of convergence $\mathcal{O}(1 / T + σ\log T/ \sqrt{T})$, similar to that of SGD under the same assumptions. To our best knowledge, this is the first complete convergence theory for dual averaging on non-convex smooth stochastic problems without restrictive assumptions, closing a long-standing open problem in the field. Beyond the base algorithm, we also discuss ADA-DA, a variant that marries SDA with AdaGrad's auto-scaling, which achieves the same rate without requiring knowledge of the noise variance.
