Table of Contents
Fetching ...

Explicit and Implicit Graduated Optimization in Deep Neural Networks

Naoki Sato, Hideaki Iiduka

TL;DR

This work investigates explicit and implicit graduated optimization in deep neural networks. It proves that Rastrigin's function is a new $2$-nice function and analyzes explicit graduated optimization with an optimal noise schedule on classical benchmarks, while showing limited gains on large DNNs. It then develops and analyzes implicit graduated optimization using SGD and extends it to SGD with momentum via SHB and NSHB, providing a convergence guarantee of $\mathcal{O}(1/\epsilon^{1/p})$ rounds to an $\epsilon$-neighborhood of the global optimum for the new $\sigma$-nice function. Empirically, the implicit approach improves training dynamics on CIFAR100 and ImageNet, with a polynomial decay scheduler (exponent $p$ in $(0,1]$) yielding the strongest performance, thus offering a principled hyperparameter scheduling strategy backed by theory. Overall, the paper advances both theory and practice of graduated optimization, delivering convergence guarantees and practical insights for hyperparameter scheduling in momentum-based SGD settings.

Abstract

Graduated optimization is a global optimization technique that is used to minimize a multimodal nonconvex function by smoothing the objective function with noise and gradually refining the solution. This paper experimentally evaluates the performance of the explicit graduated optimization algorithm with an optimal noise scheduling derived from a previous study and discusses its limitations. It uses traditional benchmark functions and empirical loss functions for modern neural network architectures for evaluating. In addition, this paper extends the implicit graduated optimization algorithm, which is based on the fact that stochastic noise in the optimization process of SGD implicitly smooths the objective function, to SGD with momentum, analyzes its convergence, and demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures.

Explicit and Implicit Graduated Optimization in Deep Neural Networks

TL;DR

This work investigates explicit and implicit graduated optimization in deep neural networks. It proves that Rastrigin's function is a new -nice function and analyzes explicit graduated optimization with an optimal noise schedule on classical benchmarks, while showing limited gains on large DNNs. It then develops and analyzes implicit graduated optimization using SGD and extends it to SGD with momentum via SHB and NSHB, providing a convergence guarantee of rounds to an -neighborhood of the global optimum for the new -nice function. Empirically, the implicit approach improves training dynamics on CIFAR100 and ImageNet, with a polynomial decay scheduler (exponent in ) yielding the strongest performance, thus offering a principled hyperparameter scheduling strategy backed by theory. Overall, the paper advances both theory and practice of graduated optimization, delivering convergence guarantees and practical insights for hyperparameter scheduling in momentum-based SGD settings.

Abstract

Graduated optimization is a global optimization technique that is used to minimize a multimodal nonconvex function by smoothing the objective function with noise and gradually refining the solution. This paper experimentally evaluates the performance of the explicit graduated optimization algorithm with an optimal noise scheduling derived from a previous study and discusses its limitations. It uses traditional benchmark functions and empirical loss functions for modern neural network architectures for evaluating. In addition, this paper extends the implicit graduated optimization algorithm, which is based on the fact that stochastic noise in the optimization process of SGD implicitly smooths the objective function, to SGD with momentum, analyzes its convergence, and demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures.

Paper Structure

This paper contains 17 sections, 6 theorems, 44 equations, 17 figures, 2 tables, 7 algorithms.

Key Result

Theorem 1

Rastrigin's function is a new $\sigma$-nice function.

Figures (17)

  • Figure 1:
  • Figure 2: Rastrigin's function of two variables
  • Figure 3: Loss function value for training versus the number of epochs in training ResNet18 on the CIFAR100 dataset. The solid lines represent the mean value, and the shaded areas represent the maximum and minimum values over three runs.
  • Figure 4: Accuracy score in testing and loss function value in training versus the number of epochs in training ResNet18 on the CIFAR100 dataset with SGD. The solid lines represent the mean value, and the shaded areas represent the maximum and minimum values over three runs.
  • Figure 5: Accuracy score in testing and loss function value in training ResNet18 on the CIFAR100 dataset with Algorithm \ref{['alg:gnc3']} versus the number of epochs. The blue plot represents vanilla SHB, and the other five plots represent Algorithm \ref{['alg:gnc3']}. "lr" means the learning rate. The solid lines represent the mean value, and the shaded areas represent the maximum and minimum values over three runs.
  • ...and 12 more figures

Theorems & Definitions (16)

  • Definition 1: Smoothed function
  • Definition 2
  • Remark 1
  • Remark 2
  • Theorem 1
  • Remark 3
  • Theorem 2: Convergence analysis of Algorithm \ref{['alg:gnc3']}
  • proof
  • Theorem 3: Convergence analysis of Algorithm \ref{['alg:sgd']}
  • Lemma 1
  • ...and 6 more