Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization
Naoki Sato, Hideaki Iiduka
TL;DR
The paper analyzes how stochastic gradient descent inherently smooths nonconvex objectives through stochastic noise, with smoothing degree $\delta=\frac{\eta C}{\sqrt{b}}$. It introduces implicit graduated optimization by scheduling $\eta$ and $b$ to progressively reduce smoothing, and extends Hazan et al.'s $\sigma$-nice framework to $\sigma_m$-nice functions, showing practical losses like cross-entropy and MSE satisfy these properties. It provides convergence guarantees for SGD-based implicit graduation on $\sigma_m$-nice functions and supports the theory with extensive experiments on ResNet- and WideResNet-scale models across CIFAR and ImageNet, illustrating how smoothing interacts with sharpness and generalization. The work highlights the practical impact of smoothing dynamics on generalization, offering a principled explanation for why decaying learning rates or increasing batch sizes can improve performance and informing hyperparameter schedules in deep learning.
Abstract
The graduated optimization approach is a method for finding global optimal solutions for nonconvex functions by using a function smoothing operation with stochastic noise. We show that stochastic noise in stochastic gradient descent (SGD) has the effect of smoothing the objective function, the degree of which is determined by the learning rate, batch size, and variance of the stochastic gradient. Using this finding, we propose and analyze a new graduated optimization algorithm that varies the degree of smoothing by varying the learning rate and batch size, and provide experimental results on image classification tasks with ResNets that support our theoretical findings. We further show that there is an interesting relationship between the degree of smoothing by SGD's stochastic noise, the well-studied ``sharpness'' indicator, and the generalization performance of the model.
