Table of Contents
Fetching ...

Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality

Ziyang Wei, Wanrong Zhu, Wei Biao Wu

TL;DR

This work develops a general theory for weighted averaged SGD, proving asymptotic normality with a sandwich covariance form $V = A^{-1}SA^{-1}$ and a weight-dependent prefactor $w$ in the limit $\sqrt{n}(\tilde{x}_n - x^*) \Rightarrow \mathcal{N}(0, wV)$. It provides online inference methods via covariance estimation and pivotal statistics, enabled by a functional CLT for partial sums. The paper analyzes concrete averaging schemes—polynomial-decay and suffix averaging—and introduces an adaptive weighted averaging scheme that achieves the optimal finite-sample MSE in a linear model while preserving the ASGD-like asymptotic covariance. Empirical results validate the CLT under various losses and demonstrate superior non-asymptotic performance of the adaptive method, highlighting a practical path to fast, statistically efficient SGD variants. Overall, the work offers principled weight design for SGD, enabling online uncertainty quantification and improved finite-sample behavior in both smooth and certain non-smooth settings.

Abstract

Stochastic Gradient Descent (SGD) is one of the most popular algorithms in statistical and machine learning due to its computational and memory efficiency. Various averaging schemes have been proposed to accelerate the convergence of SGD in different settings. In this paper, we explore a general averaging scheme for SGD. Specifically, we establish the asymptotic normality of a broad range of weighted averaged SGD solutions and provide asymptotically valid online inference approaches. Furthermore, we propose an adaptive averaging scheme that exhibits both optimal statistical rate and favorable non-asymptotic convergence, drawing insights from the optimal weight for the linear model in terms of non-asymptotic mean squared error (MSE).

Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality

TL;DR

This work develops a general theory for weighted averaged SGD, proving asymptotic normality with a sandwich covariance form and a weight-dependent prefactor in the limit . It provides online inference methods via covariance estimation and pivotal statistics, enabled by a functional CLT for partial sums. The paper analyzes concrete averaging schemes—polynomial-decay and suffix averaging—and introduces an adaptive weighted averaging scheme that achieves the optimal finite-sample MSE in a linear model while preserving the ASGD-like asymptotic covariance. Empirical results validate the CLT under various losses and demonstrate superior non-asymptotic performance of the adaptive method, highlighting a practical path to fast, statistically efficient SGD variants. Overall, the work offers principled weight design for SGD, enabling online uncertainty quantification and improved finite-sample behavior in both smooth and certain non-smooth settings.

Abstract

Stochastic Gradient Descent (SGD) is one of the most popular algorithms in statistical and machine learning due to its computational and memory efficiency. Various averaging schemes have been proposed to accelerate the convergence of SGD in different settings. In this paper, we explore a general averaging scheme for SGD. Specifically, we establish the asymptotic normality of a broad range of weighted averaged SGD solutions and provide asymptotically valid online inference approaches. Furthermore, we propose an adaptive averaging scheme that exhibits both optimal statistical rate and favorable non-asymptotic convergence, drawing insights from the optimal weight for the linear model in terms of non-asymptotic mean squared error (MSE).
Paper Structure (35 sections, 17 theorems, 211 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 17 theorems, 211 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 2.3

Given SGD eq:1 with step size $\eta_i=\eta i^{-\alpha}$ for some $\eta>0$ and $1/2<\alpha<1$, we consider the general averaging scheme (wasgd) with $w_{n,i} \ge 0$. (i) Under Assumptions as1 and as2, the following condition is sufficient for the quenched central limit theorem: for any starting point $x_0$, (ii) Consider the special case where $f(x, \xi) = \frac{1}{2} | \xi - A^{\frac{1}{2}} x |^

Figures (4)

  • Figure 1: Realizations of online suffix averaging. Here $a_{m}, m\ge 0$, is the index of the starting point of the $m$-th block.
  • Figure 2: Density plot for the standardized error with and without prefactor $w$. Here the number of iterations $n=100000$, and all the measurements are averaged over 450 independent runs. The red line denotes a standard normal distribution
  • Figure 3: Left: Log-log plots for MSE. Right: the curves stand for the ratio of MSE between different averaging schemes and adaptive weighted averaging at each step. The baseline (black line) is for the adaptive weighted averaging. Here the step size $\eta_i=i^{-0.8}$, and all the measurements are averaged over 400 independent runs.
  • Figure 4: Comparison of different weight schemes under expectile regression model with $\rho=0.8$ and the step size $\eta_i = i^{-0.505}$. The oracle weights are numerically computed via Monte-Carlo simulation with 50000 repetitions.

Theorems & Definitions (36)

  • Theorem 2.3
  • Remark 2.4
  • Corollary 2.5
  • Theorem 2.6
  • Remark 2.7
  • Remark 2.8
  • Theorem 2.9
  • Corollary 3.1
  • Corollary 3.2
  • Remark 3.3: Online algorithm for suffix averaging
  • ...and 26 more