Table of Contents
Fetching ...

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

Konstantinos Oikonomidis, Jan Quan, Panagiotis Patrinos

TL;DR

The paper studies nonlinearly preconditioned gradient methods for smooth nonconvex optimization using sigmoid-like reference functions, bridging gradient clipping and adaptive preconditioning through an anisotropic descent framework. It introduces a heavy-ball type momentum variant (m-NPGM) and a stochastic version, proving convergence under generalized smoothness via the anisotropic descent inequality and a generalized PL condition, with additional results under preconditioned Lipschitz continuity. Linear convergence up to a constant is established under 2-subhomogeneous reference functions, and stochastic analysis yields expected descent guarantees under various noise regimes. Empirical evaluations on neural networks and matrix factorization corroborate competitive performance, illustrating stability and robustness beyond traditional Lipschitz-smooth setups.

Abstract

We study nonlinearly preconditioned gradient methods for smooth nonconvex optimization problems, focusing on sigmoid preconditioners that inherently perform a form of gradient clipping akin to the widely used gradient clipping technique. Building upon this idea, we introduce a novel heavy ball-type algorithm and provide convergence guarantees under a generalized smoothness condition that is less restrictive than traditional Lipschitz smoothness, thus covering a broader class of functions. Additionally, we develop a stochastic variant of the base method and study its convergence properties under different noise assumptions. We compare the proposed algorithms with baseline methods on diverse tasks from machine learning including neural network training.

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

TL;DR

The paper studies nonlinearly preconditioned gradient methods for smooth nonconvex optimization using sigmoid-like reference functions, bridging gradient clipping and adaptive preconditioning through an anisotropic descent framework. It introduces a heavy-ball type momentum variant (m-NPGM) and a stochastic version, proving convergence under generalized smoothness via the anisotropic descent inequality and a generalized PL condition, with additional results under preconditioned Lipschitz continuity. Linear convergence up to a constant is established under 2-subhomogeneous reference functions, and stochastic analysis yields expected descent guarantees under various noise regimes. Empirical evaluations on neural networks and matrix factorization corroborate competitive performance, illustrating stability and robustness beyond traditional Lipschitz-smooth setups.

Abstract

We study nonlinearly preconditioned gradient methods for smooth nonconvex optimization problems, focusing on sigmoid preconditioners that inherently perform a form of gradient clipping akin to the widely used gradient clipping technique. Building upon this idea, we introduce a novel heavy ball-type algorithm and provide convergence guarantees under a generalized smoothness condition that is less restrictive than traditional Lipschitz smoothness, thus covering a broader class of functions. Additionally, we develop a stochastic variant of the base method and study its convergence properties under different noise assumptions. We compare the proposed algorithms with baseline methods on diverse tasks from machine learning including neural network training.

Paper Structure

This paper contains 31 sections, 14 theorems, 105 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2.2

Let assum:aniso_smooth hold and $\{x^k\}_{k \in \mathbb{N}_0}$ be the sequence of iterates generated by alg:mom with $\beta \in [0,0.5)$ and $\gamma = \tfrac{\alpha}{L}$, $\alpha \leq 1$. Then, we have the following rate:

Figures (4)

  • Figure 1: Results for training an MLP on MNIST and ResNet-18 on Cifar10. Top row is the training loss and bottom row the test accuracy. (left) MNIST MLP (middle) Cifar10 ResNet18 without momentum (right) Cifar10 ResNet18 with momentum.
  • Figure 2: Results for the matrix factorization problem. The figure on the left corresponds to $r=10$, the one in the middle to $r=20$ and the one on the right to $r=30$. It can be seen that our method, iHGDm, significantly outperforms the rest of the methods.
  • Figure 3: Results for the stochastic implementation of the phase retrieval problem \ref{['eq:phase_ret']}.
  • Figure 4: Results for training ResNet-34 on the Cifar100.

Theorems & Definitions (35)

  • Definition 1.1: anisotropic descent inequality
  • Remark 1.4: connection between $(L_0, L_1)$- and anisotropic smoothness
  • Remark 2.1
  • Theorem 2.2
  • Definition 2.3
  • Theorem 2.4
  • Proposition 2.6
  • Theorem 2.7
  • Theorem 3.1
  • Proposition 3.2
  • ...and 25 more