Table of Contents
Fetching ...

Learning-rate-free Momentum SGD with Reshuffling Converges in Nonsmooth Nonconvex Optimization

Xiaoyin Hu, Nachuan Xiao, Xin Liu, Kim-Chuan Toh

Abstract

In this paper, we propose a generalized framework for developing learning-rate-free momentum stochastic gradient descent (SGD) methods in the minimization of nonsmooth nonconvex functions, especially in training nonsmooth neural networks. Our framework adaptively generates learning rates based on the historical data of stochastic subgradients and iterates. Under mild conditions, we prove that our proposed framework enjoys global convergence to the stationary points of the objective function in the sense of the conservative field, hence providing convergence guarantees for training nonsmooth neural networks. Based on our proposed framework, we propose a novel learning-rate-free momentum SGD method (LFM). Preliminary numerical experiments reveal that LFM performs comparably to the state-of-the-art learning-rate-free methods (which have not been shown theoretically to be convergence) across well-known neural network training benchmarks.

Learning-rate-free Momentum SGD with Reshuffling Converges in Nonsmooth Nonconvex Optimization

Abstract

In this paper, we propose a generalized framework for developing learning-rate-free momentum stochastic gradient descent (SGD) methods in the minimization of nonsmooth nonconvex functions, especially in training nonsmooth neural networks. Our framework adaptively generates learning rates based on the historical data of stochastic subgradients and iterates. Under mild conditions, we prove that our proposed framework enjoys global convergence to the stationary points of the objective function in the sense of the conservative field, hence providing convergence guarantees for training nonsmooth neural networks. Based on our proposed framework, we propose a novel learning-rate-free momentum SGD method (LFM). Preliminary numerical experiments reveal that LFM performs comparably to the state-of-the-art learning-rate-free methods (which have not been shown theoretically to be convergence) across well-known neural network training benchmarks.

Paper Structure

This paper contains 22 sections, 21 theorems, 65 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Proposition 2.5

Suppose $f: \mathbb{R}^n \to \mathbb{R}$ is a locally Lipschitz continuous definable function. Then $f$ is a potential function that admits $\partial f$ as its conservative field.

Figures (5)

  • Figure 1: Numerical results on the final-epoch test accuracy of LFM with different scaling parameters $\rho$ and momentum parameters $\beta$.
  • Figure 2: Numerical results on applying LMF in Algorithm \ref{['Alg:PFSGD']}, DoG, DoWG, and D-adapted SGD for training ResNet50 over CIFAR datasets.
  • Figure 3: Numerical results on applying LMF in Algorithm \ref{['Alg:PFSGD']}, DoG, DoWG, and D-adapted SGD for training VGG-Net over CIFAR datasets.
  • Figure 4: Numerical results on applying LMF in Algorithm \ref{['Alg:PFSGD']}, DoG, DoWG, and D-adapted SGD for training MobileNet over CIFAR datasets.
  • Figure 5: Numerical results on applying LMF in Algorithm \ref{['Alg:PFSGD']}, DoG, DoWG, and D-adapted SGD for training ResNet-50 over Imagenet datasets.

Theorems & Definitions (42)

  • Definition 2.1: clarke1990optimization
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Proposition 2.5: Theorem 5.8 in davis2020stochastic
  • Proposition 2.6: Proposition 1 in pauwels2023conservative
  • Proposition 2.7
  • proof
  • Definition 2.8
  • Definition 2.9
  • ...and 32 more