Table of Contents
Fetching ...

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

Xuheng Li, Quanquan Gu

TL;DR

The paper analyzes SGD with exponential moving average (EMA) in high-dimensional linear regression to explain EMA's empirical success. It derives an instance-dependent excess risk bound that decomposes into an effectively decaying bias and a reduced variance, with the bias decaying exponentially in every eigen-subspace of the data covariance. The results show EMA achieves variance reductions compared to SGD without averaging and offers a dimension-free, spectrum-dependent characterization via effective dimensions; a lower bound demonstrates tightness. Extensions to mini-batch SGD and a broad averaging-class framework are developed, along with empirical validation of the theoretical findings. This work provides a principled, spectrum-aware understanding of EMA and its relation to other averaging schemes in stochastic optimization.

Abstract

Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

TL;DR

The paper analyzes SGD with exponential moving average (EMA) in high-dimensional linear regression to explain EMA's empirical success. It derives an instance-dependent excess risk bound that decomposes into an effectively decaying bias and a reduced variance, with the bias decaying exponentially in every eigen-subspace of the data covariance. The results show EMA achieves variance reductions compared to SGD without averaging and offers a dimension-free, spectrum-dependent characterization via effective dimensions; a lower bound demonstrates tightness. Extensions to mini-batch SGD and a broad averaging-class framework are developed, along with empirical validation of the theoretical findings. This work provides a principled, spectrum-aware understanding of EMA and its relation to other averaging schemes in stochastic optimization.

Abstract

Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.

Paper Structure

This paper contains 48 sections, 18 theorems, 111 equations, 2 figures, 1 table.

Key Result

Theorem 4.1

Suppose that Assumptions assumption:second, assumption:fourth and assumption:noise hold, and the hyperparameters satisfy Then the excess risk satisfies where the effective bias satisfies and the effective variance satisfies where the eigenvalue cutoffs are defined as

Figures (2)

  • Figure 1: Comparison of SGD with different averaging schemes. The bias error of SGD with EMA is more stable than SGD without averaging, and decays faster than iterate averaging and tail averaging when $N$ is large. The variance error of SGD with EMA remains relatively small, and is comparable to that of SGD with iterate averaging or tail averaging.
  • Figure 2: Comparison of SGD with EMA with different $\alpha$. The bias error of SGD with EMA with smaller alpha decays faster at the beginning of training, but the advantage is less significant when $N$ is large. The variance error of SGD with EMA decreases as $\alpha$ increases.

Theorems & Definitions (18)

  • Theorem 4.1: Upper bound
  • Theorem 4.2: Lower bound
  • Proposition 4.3
  • Theorem 6.1
  • Corollary 6.2
  • Lemma B.1
  • Lemma B.2
  • Lemma B.3
  • Lemma B.4
  • Lemma B.5
  • ...and 8 more