Why is parameter averaging beneficial in SGD? An objective smoothing perspective
Atsushi Nitanda, Ryuhei Kikuchi, Shugo Maeda, Denny Wu
TL;DR
The paper investigates why averaging parameters in SGD improves generalization, framing SGD through a smoothing lens where the stochastic gradient noise induces a smoothed objective $F(v)=E[ f(v-\eta \epsilon') ]$. It proves that averaged SGD can converge closer to the smoothed minimizer $v_*$ than SGD under certain noise and smoothness conditions, supported by upper and lower bounds on the SGD error $D_\infty$. Through toy examples and empirical results on CIFAR-10/100 (with ResNet-50, WRN-28-10, and Pyramid networks), it demonstrates that large-step averaging yields notable performance gains, sometimes rivaling or matching SAM, and that tail-averaging further benefits generalization. The findings illuminate a practical strategy: employ larger step sizes to drive SGD into rough, flat regions and use averaging to extract the stable, flat-region solution, improving generalization on difficult datasets.
Abstract
It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.
