Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression
Xuheng Li, Quanquan Gu
TL;DR
The paper analyzes SGD with exponential moving average (EMA) in high-dimensional linear regression to explain EMA's empirical success. It derives an instance-dependent excess risk bound that decomposes into an effectively decaying bias and a reduced variance, with the bias decaying exponentially in every eigen-subspace of the data covariance. The results show EMA achieves variance reductions compared to SGD without averaging and offers a dimension-free, spectrum-dependent characterization via effective dimensions; a lower bound demonstrates tightness. Extensions to mini-batch SGD and a broad averaging-class framework are developed, along with empirical validation of the theoretical findings. This work provides a principled, spectrum-aware understanding of EMA and its relation to other averaging schemes in stochastic optimization.
Abstract
Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.
