To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Noah Marshall; Ke Liang Xiao; Atish Agarwala; Elliot Paquette

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Noah Marshall, Ke Liang Xiao, Atish Agarwala, Elliot Paquette

TL;DR

A theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality finds a deterministic equation that describes the evolution of the loss and demonstrates that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data.

Abstract

The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss and demonstrate that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. We propose a simple heuristic for near optimal scheduling of the clipping threshold which requires the tuning of only one hyperparameter. We conclude with a discussion about the links between high-dimensional clipping and neural network training.

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

TL;DR

Abstract

Paper Structure (46 sections, 15 theorems, 189 equations, 9 figures)

This paper contains 46 sections, 15 theorems, 189 equations, 9 figures.

Introduction
Related work:
Problem setup
Clipped homogenized SGD
Remark:
Extracting deterministic dynamics:
Stability analysis
When does clipped SGD outperform unclipped SGD?
Isotropic data
Anisotropic data
A heuristic for the optimal clipping schedule
Conclusion
Full formulation of Theorem \ref{['thm:main_thm_gauss']} with non-Gaussian data
Proof of main theorems
Proof of Theorem \ref{['thm:main_thm_gauss']}
...and 31 more sections

Key Result

Theorem 1

Suppose that Assumptions ass:linear_plus_noise_targets, ass:data_covariance and ass:clipping_and_stepsie_params hold. Suppose that $\boldsymbol{\Theta}_t$ and $\it_k$ are independent realizations of C-HSGD and C-SGD with equal, deterministic initial conditions. Let $\overline{c} = \sup_t c(t)$ and $ with probability $1-e^{-u}$ and provided the right hand side is less than $1$. The stochastic proce

Figures (9)

Figure 1: Comparison of C-SGD, C-HSGD and their deterministic equivalent (ODE) with Gaussian or Cauchy noise, from the Student-t family. The solution to the unclipped ODE for reference in the Gaussian case. Unclipped SGD does not converge under Cauchy noise. We also compare clipped SGD with CIFAR10 as well as Wikitext2 data. In all cases, the deterministic equivalent ODE closely match the actual path of clipped SGD. Experiment details are available in Appendix \ref{['app:exp_details']}.
Figure 2: The \ref{['eq:clip_stability_criterion']} and \ref{['eq:clip_comparison_criterion']} across various noise distributions: Gaussian (Gau), Rademacher-like (Rad), uniform on $[-M,M]$ (Uni), and symmetrized exponential (Exp) noise. The \ref{['eq:clip_stability_criterion']} is computed with $R = 3,\ \sigma = 9,\ p = 0.7$; the \ref{['eq:clip_comparison_criterion']} figure uses $R = 3,\ \sigma = 5,\ p = 0.2$ (where $p$ is a parameter for Rademacher-like noise). Parameters are chosen to illustrate different behaviours.
Figure 3: The maximum over $c$ of the \ref{['eq:clip_comparison_criterion']} for various values of the risk. Notice that the maximum value of the \ref{['eq:clip_comparison_criterion']} for both uniform and Gaussian noise is $1$, corresponding to unclipped SGD. Plots are computed with $\sigma = 7,\ p = 0.5$ where $p$ is a parameter in the Rademacher-like noise.
Figure 4: Results of clipped versus unclipped SGD under the setting of Theorem \ref{['thm:anisotropic_when_clipping_better']}. We compare the optimal max-\ref{['eq:clip_comparison_criterion']} to the heuristic schedule in Equation \ref{['eq:approx_opt_clip']}. Notice that clipping cannot improve SGD in the setting with Gaussian noise while it noticeably improves performance with Rademacher-like noise. Moreover, the heuristic schedule and the optimal schedule perform nearly identically. The unclipped learning rate is constantly $\eta = 0.4$ while $\sigma = 0.8$. We compare Gaussian and Rademacher-like noise with $p = 0.2$. SGD is presented with $80$% confidence intervals over $100$ runs.
Figure 5: The CSC and CCC where the noise is Student-t distributed. We hold the variance fixed and vary over the degrees-of-freedom (DOF) parameter. Notice how for high DOF, the thresolds resemble Gaussian behaviour (compare to Figure 2 in the main paper). Meanwhile for small DOF, we see that both the CSC and CCC are high, suggesting clipping is particularly effective. This reflects the heavy-tailed behaviour of the small DOF Student-t distribution.
...and 4 more figures

Theorems & Definitions (25)

Definition 1
Definition 2: Intrinsic Dimension
Definition 3: Clipped Homogenized SGD
Theorem 1
Example 1: Isotropic data
Theorem 2
Theorem 3
Theorem 4
proof
Theorem 5
...and 15 more

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

TL;DR

Abstract

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (25)