Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees

Nachuan Xiao; Xiaoyin Hu; Xin Liu; Kim-Chuan Toh

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees

Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh

TL;DR

This paper addresses the challenge of guaranteeing convergence for Adam-family optimization methods when the objective is nonsmooth and nonconvex, as encountered in training nonsmooth neural networks. It introduces a two-timescale stochastic framework based on conservative field theory to establish convergence to $\mathcal{D}_f$-stationary points and, under random initialization, to Clarke stationary points, even with heavy-tailed evaluation noise. Moreover, it extends the framework to stochastic subgradient methods with gradient clipping, enabling convergence under integrable heavy-tailed noises, and shows that popular Adam-family methods (Adam, AdaBelief, AMSGrad, NAdam, Yogi) satisfy the framework with diminishing stepsizes. Extensive numerical experiments on vision and NLP tasks corroborate the theory, showing competitive performance with standard PyTorch implementations and improved robustness under heavy-tailed noise. Overall, the work provides practical convergence guarantees for nonsmooth neural network optimization and broadens the applicability of Adam-family methods in realistic, noisy settings.

Abstract

In this paper, we present a comprehensive study on the convergence properties of Adam-family methods for nonsmooth optimization, especially in the training of nonsmooth neural networks. We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions. Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks. Furthermore, we develop stochastic subgradient methods that incorporate gradient clipping techniques for training nonsmooth neural networks with heavy-tailed noise. Through our framework, we show that our proposed methods converge even when the evaluation noises are only assumed to be integrable. Extensive numerical experiments demonstrate the high efficiency and robustness of our proposed methods.

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees

TL;DR

-stationary points and, under random initialization, to Clarke stationary points, even with heavy-tailed evaluation noise. Moreover, it extends the framework to stochastic subgradient methods with gradient clipping, enabling convergence under integrable heavy-tailed noises, and shows that popular Adam-family methods (Adam, AdaBelief, AMSGrad, NAdam, Yogi) satisfy the framework with diminishing stepsizes. Extensive numerical experiments on vision and NLP tasks corroborate the theory, showing competitive performance with standard PyTorch implementations and improved robustness under heavy-tailed noise. Overall, the work provides practical convergence guarantees for nonsmooth neural network optimization and broadens the applicability of Adam-family methods in realistic, noisy settings.

Abstract

Paper Structure (26 sections, 25 theorems, 120 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 25 theorems, 120 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Challenges from Training Nonsmooth Neural Networks
Challenges from Heavy-tailed Evaluation Noises
Contributions
Organization
Preliminary
Basic Notations
Probability Theory
Nonsmooth Analysis
Clarke Subdifferential
Conservative Field
Differential Inclusion and Stochastic Subgradient Methods
A General Framework for Convergence Properties
Convergence to $\mathcal{D}_f$-stationary Points
Convergence to $\partial f$-stationary Points with Random Initialization
...and 11 more sections

Key Result

Proposition 1

Suppose $\{\eta_k\}$ and $\{\theta_k\}$ are two diminishing positive sequences of real numbers that satisfy Let $\lambda_0 := 0$, $\lambda_i := \sum_{k = 0}^{i-1} \eta_k$, and $\Lambda(t) := \sup \{k \geq 0: t\geq \lambda_k\}$. Then for any $T > 0$, and any uniformly bounded martingale difference sequence $\{\xi_k\}$, almost surely, it holds that

Figures (5)

Figure 1: Test results on CIFAR-10 data set with ResNet50. Here "acc." is the abbreviation of "accuracy".
Figure 2: Test results on CIFAR-100 data set with ResNet50. Here "acc." is the abbreviation of "accuracy".
Figure 3: Test results on MNIST data set with LeNet.
Figure 4: Test results on CIFAR data sets with ResNet50.
Figure 5: Numerical results on NLP tasks.

Theorems & Definitions (47)

Definition 1
Definition 2
Definition 3
Definition 4
Proposition 1
Definition 5: clarke1990optimization
Definition 6: clarke1990optimization
Definition 7
Definition 8
Definition 9: Aumann’s integral
...and 37 more

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees

TL;DR

Abstract

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (47)