Table of Contents
Fetching ...

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization

Naoki Sato, Hideaki Iiduka

TL;DR

The paper analyzes how stochastic gradient descent inherently smooths nonconvex objectives through stochastic noise, with smoothing degree $\delta=\frac{\eta C}{\sqrt{b}}$. It introduces implicit graduated optimization by scheduling $\eta$ and $b$ to progressively reduce smoothing, and extends Hazan et al.'s $\sigma$-nice framework to $\sigma_m$-nice functions, showing practical losses like cross-entropy and MSE satisfy these properties. It provides convergence guarantees for SGD-based implicit graduation on $\sigma_m$-nice functions and supports the theory with extensive experiments on ResNet- and WideResNet-scale models across CIFAR and ImageNet, illustrating how smoothing interacts with sharpness and generalization. The work highlights the practical impact of smoothing dynamics on generalization, offering a principled explanation for why decaying learning rates or increasing batch sizes can improve performance and informing hyperparameter schedules in deep learning.

Abstract

The graduated optimization approach is a method for finding global optimal solutions for nonconvex functions by using a function smoothing operation with stochastic noise. We show that stochastic noise in stochastic gradient descent (SGD) has the effect of smoothing the objective function, the degree of which is determined by the learning rate, batch size, and variance of the stochastic gradient. Using this finding, we propose and analyze a new graduated optimization algorithm that varies the degree of smoothing by varying the learning rate and batch size, and provide experimental results on image classification tasks with ResNets that support our theoretical findings. We further show that there is an interesting relationship between the degree of smoothing by SGD's stochastic noise, the well-studied ``sharpness'' indicator, and the generalization performance of the model.

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization

TL;DR

The paper analyzes how stochastic gradient descent inherently smooths nonconvex objectives through stochastic noise, with smoothing degree . It introduces implicit graduated optimization by scheduling and to progressively reduce smoothing, and extends Hazan et al.'s -nice framework to -nice functions, showing practical losses like cross-entropy and MSE satisfy these properties. It provides convergence guarantees for SGD-based implicit graduation on -nice functions and supports the theory with extensive experiments on ResNet- and WideResNet-scale models across CIFAR and ImageNet, illustrating how smoothing interacts with sharpness and generalization. The work highlights the practical impact of smoothing dynamics on generalization, offering a principled explanation for why decaying learning rates or increasing batch sizes can improve performance and informing hyperparameter schedules in deep learning.

Abstract

The graduated optimization approach is a method for finding global optimal solutions for nonconvex functions by using a function smoothing operation with stochastic noise. We show that stochastic noise in stochastic gradient descent (SGD) has the effect of smoothing the objective function, the degree of which is determined by the learning rate, batch size, and variance of the stochastic gradient. Using this finding, we propose and analyze a new graduated optimization algorithm that varies the degree of smoothing by varying the learning rate and batch size, and provide experimental results on image classification tasks with ResNets that support our theoretical findings. We further show that there is an interesting relationship between the degree of smoothing by SGD's stochastic noise, the well-studied ``sharpness'' indicator, and the generalization performance of the model.
Paper Structure (32 sections, 12 theorems, 62 equations, 15 figures, 2 tables, 2 algorithms)

This paper contains 32 sections, 12 theorems, 62 equations, 15 figures, 2 tables, 2 algorithms.

Key Result

Lemma 2.1

Suppose that (A3)(ii) and (A4) hold for all $t \in \mathbb{N}$; then, $\mathbb{E}_{\xi_t} \left[ \| \nabla f_{\mathcal{S}_t}(\bm{x}_t) - \nabla f(\bm{x}_t) \|^2 \right] \leq \frac{C^2}{b}.$

Figures (15)

  • Figure 1: (A) Conceptual diagram of implicit graduated optimization for $\sigma_m$-nice function. (B) Sharpness after 200 epochs of training ResNet18 on the CIFAR100 dataset versus degree of smoothing calculated from learning rate, batch size, and the estimated variance of the stochastic gradient. (C) Test accuracy after same training versus degree of smoothing. The color shading in the scatter plots represents the batch size: the larger the batch size, the darker the color of the plotted points.
  • Figure 2: Accuracy score for the testing and loss function value for training versus the number of epochs in training ResNet34 on the ImageNet dataset. The solid line represents the mean value, and the shaded area represents the maximum and minimum over three runs. In method 1, the learning rate and batch size were fixed at 0.1 and 256, respectively. In method 2, the learning rate was decreased every 40 epochs as $\left[0.1, \frac{1}{10\sqrt{2}}, 0.05, \frac{1}{20\sqrt{2}}, 0.025\right]$ and the batch size was fixed at 256. In method 3, the learning rate was fixed at 0.1, and the batch size was increased as $\left[32, 64, 128, 256, 512\right]$. In method 4, the learning rate was decreased as $\left[0.1, \frac{\sqrt{3}}{20}, 0.075, \frac{3\sqrt{3}}{80}, 0.05625\right]$ and the batch size was increased as $\left[32, 48, 72, 108, 162\right]$.
  • Figure 3: (a) Sharpness around the approximate solution after 200 epochs of ResNet18 training on the CIFAR100 dataset versus batch size used. (b) Sharpness versus learning rate used. (c) Sharpness versus degree of smoothing calculated from learning rate, batch size and estimated variance of the stochastic gradient. (d) Test accuracy after 200 epochs training versus sharpness. (e) Test accuracy versus degree of smoothing. The solid line represents the mean value, and the shaded area represents the maximum and minimum over three runs. The color shade in the scatter plots represents the batch size; the larger the batch size, the darker the color of the plotted points. "lr" means learning rate. The experimental results that make up the all graphs are all identical. See Figure \ref{['fig:999full']} for a larger version of this graph.
  • Figure 4: Accuracy score for testing and loss function value for training versus the number of epochs (left) and the number of parameter updates (right) in training ResNet18 on the CIFAR100 dataset. The solid line represents the mean value, and the shaded area represents the maximum and minimum over three runs. In method 1, the learning rate and the batch size were fixed at 0.1 and 128, respectively. In method 2, the learning rate decreased every 40 epochs as $\left[0.1, \frac{1}{10\sqrt{2}}, 0.05, \frac{1}{20\sqrt{2}}, 0.025\right]$ and the batch size was fixed at 128. In method 3, the learning rate was fixed at 0.1, and the batch size was increased as $\left[16, 32, 64, 128, 256\right]$. In method 4, the learning rate was decreased as $\left[0.1, \frac{\sqrt{3}}{20}, 0.075, \frac{3\sqrt{3}}{80}, 0.05625\right]$ and the batch size was increased as $\left[32, 48, 72, 108, 162\right]$.
  • Figure 5: Accuracy score for testing and loss function value for training versus the number of epochs (left) and the number of parameter updates (right) in training WideResNet-28-10 on the CIFAR100 dataset. The solid line represents the mean value, and the shaded area represents the maximum and minimum over three runs. In method 1, the learning rate and batch size were fixed at 0.1 and 128, respectively. In method 2, the learning rate was decreased every 40 epochs as $\left[0.1, \frac{1}{10\sqrt{2}}, 0.05, \frac{1}{20\sqrt{2}}, 0.025\right]$ and the batch size was fixed at 128. In method 3, the learning rate was fixed at 0.1, and the batch size was increased as $\left[8, 16, 32, 64, 128\right]$. In method 4, the learning rate was decreased as $\left[0.1, \frac{\sqrt{3}}{20}, 0.075, \frac{3\sqrt{3}}{80}, 0.05625\right]$ and the batch size was increased as $\left[8, 12, 18, 27, 40\right]$.
  • ...and 10 more figures

Theorems & Definitions (30)

  • Definition 2.1: Smoothed function
  • Lemma 2.1
  • Definition 4.1: $\sigma_m$-nice function
  • Proposition 4.1
  • Theorem 4.1: Convergence analysis of Algorithm \ref{['alg:sgd2']}
  • Proposition 4.2
  • Theorem 4.2: Convergence analysis of Algorithm \ref{['alg:gnc2']}
  • Lemma 5.1
  • proof
  • proof
  • ...and 20 more