Why is parameter averaging beneficial in SGD? An objective smoothing perspective

Atsushi Nitanda; Ryuhei Kikuchi; Shugo Maeda; Denny Wu

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

Atsushi Nitanda, Ryuhei Kikuchi, Shugo Maeda, Denny Wu

TL;DR

The paper investigates why averaging parameters in SGD improves generalization, framing SGD through a smoothing lens where the stochastic gradient noise induces a smoothed objective $F(v)=E[ f(v-\eta \epsilon') ]$. It proves that averaged SGD can converge closer to the smoothed minimizer $v_*$ than SGD under certain noise and smoothness conditions, supported by upper and lower bounds on the SGD error $D_\infty$. Through toy examples and empirical results on CIFAR-10/100 (with ResNet-50, WRN-28-10, and Pyramid networks), it demonstrates that large-step averaging yields notable performance gains, sometimes rivaling or matching SAM, and that tail-averaging further benefits generalization. The findings illuminate a practical strategy: employ larger step sizes to drive SGD into rough, flat regions and use averaging to extract the stable, flat-region solution, improving generalization on difficult datasets.

Abstract

It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

TL;DR

The paper investigates why averaging parameters in SGD improves generalization, framing SGD through a smoothing lens where the stochastic gradient noise induces a smoothed objective

. It proves that averaged SGD can converge closer to the smoothed minimizer

than SGD under certain noise and smoothness conditions, supported by upper and lower bounds on the SGD error

. Through toy examples and empirical results on CIFAR-10/100 (with ResNet-50, WRN-28-10, and Pyramid networks), it demonstrates that large-step averaging yields notable performance gains, sometimes rivaling or matching SAM, and that tail-averaging further benefits generalization. The findings illuminate a practical strategy: employ larger step sizes to drive SGD into rough, flat regions and use averaging to extract the stable, flat-region solution, improving generalization on difficult datasets.

Abstract

Paper Structure (36 sections, 11 theorems, 80 equations, 11 figures, 4 tables)

This paper contains 36 sections, 11 theorems, 80 equations, 11 figures, 4 tables.

INTRODUCTION
Contributions
PRELIMINARY
Stochastic gradient descent
Alternative view of SGD
CONVERGENCE ANALYSIS
Analysis of averaged SGD
Evaluation of SGD error $D_\infty$
Example
EXPERIMENTS
Related Literature and Discussion
Flat Minimum.
Markov Chain Interpretation of SGD.
Step size and Minibatch.
Edge of Stability.
...and 21 more sections

Key Result

Theorem 1

Under Assumptions (A1)-- (A4), run the averaged SGD for $T$-iterations with the step size $\eta \leq \frac{1}{2L}$, then $\overline{v}_T$ satisfies the following inequality: where we set $D_T := \sqrt{\frac{1}{T+1}\mathbb{E}\left[\sum_{t=0}^T \left\| v_t - v_* \right\|^2 \right]}$.

Figures (11)

Figure 1: We run SGD and averaged SGD 500 times with the uniform stochastic gradient noise for two objective functions (top and bottom). Figure (a) depicts the objective function $f$ (green, $\eta=0$) and smoothed objectives $F$ (red and blue, $\eta>0$). Figures (b) and (c) plot convergent points by SGD and averaged SGD with histograms, respectively.
Figure 2: The figure plots the original objective (green), smoothed objectives (blue, darker is smoother), and convergent points obtained by the averaged SGD which is run 500 times for each step size $\eta \in \{0,1,~0.3,~0.5,~0.7,~0.9\}$.
Figure 3: We run SGD and averaged SGD for two problems corresponding top and bottom figures. For each case, the left figure depicts the original objective (blue) $f$ and smoothed objective (orange) $F$ and the right figure depicts convergent points of SGD (red) and averaged SGD (blue).
Figure 4: Test accuracies achieved by SGD and averaged SGD on CIFAR100 with ResNet-50 and WRN-28-10.
Figure 5: Sections of the train (red) and test (blue) loss landscapes across the parameters obtained by averaged SGD (distance=$0$) and SGD (distance=$1$) for ResNet-50 with CIFAR100 dataset. SGD is run with a small step size after running averaged SGD with a large step size. The middle figure is the close-up view at the edge. The triangle and circle markers represent convergent parameters by SGD and averaged SGD, respectively. The right figure plots smoothed train loss functions (green, darker is smoother) with Gaussian noises in addition to train and test losses. The blank circles are the minimizers of smoothed objectives.
...and 6 more figures

Theorems & Definitions (22)

Example 1
Theorem 1
proof : Proof sketch
Proposition 1
Proposition 2
Example 2
Lemma A
proof
Lemma B
proof
...and 12 more

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

TL;DR

Abstract

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (22)