Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality
Ziyang Wei, Wanrong Zhu, Wei Biao Wu
TL;DR
This work develops a general theory for weighted averaged SGD, proving asymptotic normality with a sandwich covariance form $V = A^{-1}SA^{-1}$ and a weight-dependent prefactor $w$ in the limit $\sqrt{n}(\tilde{x}_n - x^*) \Rightarrow \mathcal{N}(0, wV)$. It provides online inference methods via covariance estimation and pivotal statistics, enabled by a functional CLT for partial sums. The paper analyzes concrete averaging schemes—polynomial-decay and suffix averaging—and introduces an adaptive weighted averaging scheme that achieves the optimal finite-sample MSE in a linear model while preserving the ASGD-like asymptotic covariance. Empirical results validate the CLT under various losses and demonstrate superior non-asymptotic performance of the adaptive method, highlighting a practical path to fast, statistically efficient SGD variants. Overall, the work offers principled weight design for SGD, enabling online uncertainty quantification and improved finite-sample behavior in both smooth and certain non-smooth settings.
Abstract
Stochastic Gradient Descent (SGD) is one of the most popular algorithms in statistical and machine learning due to its computational and memory efficiency. Various averaging schemes have been proposed to accelerate the convergence of SGD in different settings. In this paper, we explore a general averaging scheme for SGD. Specifically, we establish the asymptotic normality of a broad range of weighted averaged SGD solutions and provide asymptotically valid online inference approaches. Furthermore, we propose an adaptive averaging scheme that exhibits both optimal statistical rate and favorable non-asymptotic convergence, drawing insights from the optimal weight for the linear model in terms of non-asymptotic mean squared error (MSE).
