Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees
Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh
TL;DR
This paper addresses the challenge of guaranteeing convergence for Adam-family optimization methods when the objective is nonsmooth and nonconvex, as encountered in training nonsmooth neural networks. It introduces a two-timescale stochastic framework based on conservative field theory to establish convergence to $\mathcal{D}_f$-stationary points and, under random initialization, to Clarke stationary points, even with heavy-tailed evaluation noise. Moreover, it extends the framework to stochastic subgradient methods with gradient clipping, enabling convergence under integrable heavy-tailed noises, and shows that popular Adam-family methods (Adam, AdaBelief, AMSGrad, NAdam, Yogi) satisfy the framework with diminishing stepsizes. Extensive numerical experiments on vision and NLP tasks corroborate the theory, showing competitive performance with standard PyTorch implementations and improved robustness under heavy-tailed noise. Overall, the work provides practical convergence guarantees for nonsmooth neural network optimization and broadens the applicability of Adam-family methods in realistic, noisy settings.
Abstract
In this paper, we present a comprehensive study on the convergence properties of Adam-family methods for nonsmooth optimization, especially in the training of nonsmooth neural networks. We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions. Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks. Furthermore, we develop stochastic subgradient methods that incorporate gradient clipping techniques for training nonsmooth neural networks with heavy-tailed noise. Through our framework, we show that our proposed methods converge even when the evaluation noises are only assumed to be integrable. Extensive numerical experiments demonstrate the high efficiency and robustness of our proposed methods.
