Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Matteo Tucat; Anirbit Mukherjee; Procheta Sen; Mingfei Sun; Omar Rivasplata

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Matteo Tucat, Anirbit Mukherjee, Procheta Sen, Mingfei Sun, Omar Rivasplata

TL;DR

This paper introduces δ-Regularized-GClip, a novel adaptive gradient method that regularizes gradient clipping with a lower bound on the effective step size to enable provable convergence to global minima for overparameterized, wide neural networks trained on the squared loss. The key theoretical contribution is a μ-PL$^*$-based convergence guarantee, showing geometric decay of the training loss and keeping iterates within a finite neighborhood of initialization, provided the network width is sufficiently large and the step-size schedule satisfies $h(\mathbf{w}_t) \in [\eta\delta, \eta]$. Empirically, δ-GClip is competitive with state-of-the-art optimizers such as Adam and SGD across ResNet-18 on CIFAR-10, a VAE on Fashion-MNIST, Vision Transformers, and BERT fine-tuning, with scheduling enhancing performance. The work demonstrates that a principled, theoretically grounded variant of gradient clipping can match or exceed heuristic adaptive methods in diverse architectures, suggesting broad applicability and a path toward provable training for deep networks beyond the squared loss. It also motivates future extensions to cross-entropy loss and non-ReLU activations, as well as deeper investigations into PL$^*$ validity for large-scale transformer models.

Abstract

We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed $δ-$GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded $δ-$GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Lojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

TL;DR

-based convergence guarantee, showing geometric decay of the training loss and keeping iterates within a finite neighborhood of initialization, provided the network width is sufficiently large and the step-size schedule satisfies

. Empirically, δ-GClip is competitive with state-of-the-art optimizers such as Adam and SGD across ResNet-18 on CIFAR-10, a VAE on Fashion-MNIST, Vision Transformers, and BERT fine-tuning, with scheduling enhancing performance. The work demonstrates that a principled, theoretically grounded variant of gradient clipping can match or exceed heuristic adaptive methods in diverse architectures, suggesting broad applicability and a path toward provable training for deep networks beyond the squared loss. It also motivates future extensions to cross-entropy loss and non-ReLU activations, as well as deeper investigations into PL

validity for large-scale transformer models.

Abstract

GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded

GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Lojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.

Paper Structure (26 sections, 9 theorems, 47 equations, 6 figures)

This paper contains 26 sections, 9 theorems, 47 equations, 6 figures.

Introduction
Notation
The Main Results
Theory for $\delta-$Regularized-GClip
Experimental Evidence for The Performance of $\delta-$Regularized-GClip
Experiments with a ResNet and a VAE
ResNet-18 on CIFAR-10.
Experiments Without Learning Rate Scheduling.
Experiments With Learning Rate Scheduling.
VAE on Fashion-MNIST.
Evidence for The Performance of $\delta-$Regularized-GClip on Transformers
Related Works
Literature Review of Theory for Adam.
Review of Theory for Adaptive Gradient Methods Training Neural Nets.
Literature Review of Gradient Clipping.
...and 11 more sections

Key Result

Theorem 2.1

Suppose an overparametrized neural network $f$ is being trained using the square loss $\mathcal{L} ({\bm{w}})$, as specified in Definition def:setup. Then $\exists ~\lambda_0 >0$ s.t for any $\eta, \mu, \delta >0$ appropriately small enough, if the minimum width of the network layers satisfies then one can initialize the weights s.t, w.h.p over initialization, the above loss is $\mu$-$\rm{PL}^*$

Figures (6)

Figure 1: $\delta-$Regularized-GClip ($\delta$-GClip) is competitive against SOTA heuristics for training ResNet-18 on CIFAR-10 without learning-rate scheduling.
Figure 2: $\delta-$Regularized-GClip ($\delta-$GClip) outperforms other optimizers for training ResNet-18 on CIFAR-10 with learning-rate scheduling.
Figure 3: $\delta-$Regularized-GClip ($\delta$-GCLip) matches the best heuristics for training a ResNet-18 on CIFAR-10 with learning-rate scheduling, but no weight-decay.
Figure 4: $\delta-$Regularized-GClip ($\delta$-GClip) is competitive against SOTA heuristics for training a VAE on the Fashion-MNIST dataset with learning-rate scheduling.
Figure 5: $\delta$-GClip can be seen to be competitive against the SOTA heuristic of using Adam for training a Vision Transformer on the CIFAR-10 dataset.
...and 1 more figures

Theorems & Definitions (23)

Definition 1: GClip
Definition 2: $\delta-$Regularized-GClip
Definition 3: $\mu$-PL* Condition
Definition 4
Theorem 2.1: $\delta-$Regularized-GClip Provably Trains Wide and Deep Neural Nets
Remark
Definition 5: Stochastic $\delta-$Regularized-GClip
Theorem 2.2: Convergence of Stochastic $\delta-$Regularized-GClip
Theorem 4.1
Lemma 5.1
...and 13 more

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

TL;DR

Abstract

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (23)