Table of Contents
Fetching ...

A theoretical and empirical study of new adaptive algorithms with additional momentum steps and shifted updates for stochastic non-convex optimization

Cristian Daniel Alecsa

TL;DR

This work addresses stochastic non-convex optimization by introducing adaptive momentum methods with shifted updates, establishing a rigorous link between accelerated gradient schemes and AMSGrad-type adaptive momentum methods through two inertial steps. It provides a comprehensive convergence theory, including finite-time bounds for $\mathbb{E}[\min_{n\le N}\|\nabla F(\vb z_n)\|^2]$ and asymptotic descent properties, under standard smoothness and stochasticity assumptions. The authors complement theory with neural-network experiments on MNIST and CIFAR-10, showing competitive performance and stability relative to AMSGrad and other Adam-family optimizers, and they supply a PyTorch-ready formulation (AAMMSU) based on the Sutskever scheme. Overall, the paper contributes a novel AMSGrad-like algorithm with shifted updates, a robust theoretical framework tying momentum to acceleration, and practical guidance for implementation and future extensions to non-smooth or more general stochastic problems.

Abstract

It is known that adaptive optimization algorithms represent the key pillar behind the rise of the Machine Learning field. In the Optimization literature numerous studies have been devoted to accelerated gradient methods but only recently adaptive iterative techniques were analyzed from a theoretical point of view. In the present paper we introduce new adaptive algorithms endowed with momentum terms for stochastic non-convex optimization problems. Our purpose is to show a deep connection between accelerated methods endowed with different inertial steps and AMSGrad-type momentum methods. Our methodology is based on the framework of stochastic and possibly non-convex objective mappings, along with some assumptions that are often used in the investigation of adaptive algorithms. In addition to discussing the finite-time horizon analysis in relation to a certain final iteration and the almost sure convergence to stationary points, we shall also look at the worst-case iteration complexity. This will be followed by an estimate for the expectation of the squared Euclidean norm of the gradient. Various computational simulations for the training of neural networks are being used to support the theoretical analysis. For future research we emphasize that there are multiple possible extensions to our work, from which we mention the investigation regarding non-smooth objective functions and the theoretical analysis of a more general formulation that encompass our adaptive optimizers in a stochastic framework.

A theoretical and empirical study of new adaptive algorithms with additional momentum steps and shifted updates for stochastic non-convex optimization

TL;DR

This work addresses stochastic non-convex optimization by introducing adaptive momentum methods with shifted updates, establishing a rigorous link between accelerated gradient schemes and AMSGrad-type adaptive momentum methods through two inertial steps. It provides a comprehensive convergence theory, including finite-time bounds for and asymptotic descent properties, under standard smoothness and stochasticity assumptions. The authors complement theory with neural-network experiments on MNIST and CIFAR-10, showing competitive performance and stability relative to AMSGrad and other Adam-family optimizers, and they supply a PyTorch-ready formulation (AAMMSU) based on the Sutskever scheme. Overall, the paper contributes a novel AMSGrad-like algorithm with shifted updates, a robust theoretical framework tying momentum to acceleration, and practical guidance for implementation and future extensions to non-smooth or more general stochastic problems.

Abstract

It is known that adaptive optimization algorithms represent the key pillar behind the rise of the Machine Learning field. In the Optimization literature numerous studies have been devoted to accelerated gradient methods but only recently adaptive iterative techniques were analyzed from a theoretical point of view. In the present paper we introduce new adaptive algorithms endowed with momentum terms for stochastic non-convex optimization problems. Our purpose is to show a deep connection between accelerated methods endowed with different inertial steps and AMSGrad-type momentum methods. Our methodology is based on the framework of stochastic and possibly non-convex objective mappings, along with some assumptions that are often used in the investigation of adaptive algorithms. In addition to discussing the finite-time horizon analysis in relation to a certain final iteration and the almost sure convergence to stationary points, we shall also look at the worst-case iteration complexity. This will be followed by an estimate for the expectation of the squared Euclidean norm of the gradient. Various computational simulations for the training of neural networks are being used to support the theoretical analysis. For future research we emphasize that there are multiple possible extensions to our work, from which we mention the investigation regarding non-smooth objective functions and the theoretical analysis of a more general formulation that encompass our adaptive optimizers in a stochastic framework.

Paper Structure

This paper contains 28 sections, 8 theorems, 142 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Let $\vb u \in\mathbb{R}^d$ be an arbitrarily given vector. Then, it follows that

Figures (6)

  • Figure 1: Comparison of optimizers
  • Figure 2: Heatmaps for LR-MNIST and \ref{['AdaptiveAcceleratedMomentumMethodShiftedUpdates']}
  • Figure 3: Batch size - learning rate evolution for LR-MNIST
  • Figure 4: Convergence profiles for CNN-CIFAR10
  • Figure 5: Batch size - learning rate evolution for CNN-CIFAR10
  • ...and 1 more figures

Theorems & Definitions (16)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 4
  • Remark 5
  • Theorem 6
  • Remark 7
  • Proposition 8
  • Remark 9
  • Example 10
  • ...and 6 more