Table of Contents
Fetching ...

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Noah Marshall, Ke Liang Xiao, Atish Agarwala, Elliot Paquette

TL;DR

A theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality finds a deterministic equation that describes the evolution of the loss and demonstrates that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data.

Abstract

The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss and demonstrate that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. We propose a simple heuristic for near optimal scheduling of the clipping threshold which requires the tuning of only one hyperparameter. We conclude with a discussion about the links between high-dimensional clipping and neural network training.

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

TL;DR

A theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality finds a deterministic equation that describes the evolution of the loss and demonstrates that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data.

Abstract

The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss and demonstrate that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. We propose a simple heuristic for near optimal scheduling of the clipping threshold which requires the tuning of only one hyperparameter. We conclude with a discussion about the links between high-dimensional clipping and neural network training.
Paper Structure (46 sections, 15 theorems, 189 equations, 9 figures)

This paper contains 46 sections, 15 theorems, 189 equations, 9 figures.

Key Result

Theorem 1

Suppose that Assumptions ass:linear_plus_noise_targets, ass:data_covariance and ass:clipping_and_stepsie_params hold. Suppose that $\boldsymbol{\Theta}_t$ and $\it_k$ are independent realizations of C-HSGD and C-SGD with equal, deterministic initial conditions. Let $\overline{c} = \sup_t c(t)$ and $ with probability $1-e^{-u}$ and provided the right hand side is less than $1$. The stochastic proce

Figures (9)

  • Figure 1: Comparison of C-SGD, C-HSGD and their deterministic equivalent (ODE) with Gaussian or Cauchy noise, from the Student-t family. The solution to the unclipped ODE for reference in the Gaussian case. Unclipped SGD does not converge under Cauchy noise. We also compare clipped SGD with CIFAR10 as well as Wikitext2 data. In all cases, the deterministic equivalent ODE closely match the actual path of clipped SGD. Experiment details are available in Appendix \ref{['app:exp_details']}.
  • Figure 2: The \ref{['eq:clip_stability_criterion']} and \ref{['eq:clip_comparison_criterion']} across various noise distributions: Gaussian (Gau), Rademacher-like (Rad), uniform on $[-M,M]$ (Uni), and symmetrized exponential (Exp) noise. The \ref{['eq:clip_stability_criterion']} is computed with $R = 3,\ \sigma = 9,\ p = 0.7$; the \ref{['eq:clip_comparison_criterion']} figure uses $R = 3,\ \sigma = 5,\ p = 0.2$ (where $p$ is a parameter for Rademacher-like noise). Parameters are chosen to illustrate different behaviours.
  • Figure 3: The maximum over $c$ of the \ref{['eq:clip_comparison_criterion']} for various values of the risk. Notice that the maximum value of the \ref{['eq:clip_comparison_criterion']} for both uniform and Gaussian noise is $1$, corresponding to unclipped SGD. Plots are computed with $\sigma = 7,\ p = 0.5$ where $p$ is a parameter in the Rademacher-like noise.
  • Figure 4: Results of clipped versus unclipped SGD under the setting of Theorem \ref{['thm:anisotropic_when_clipping_better']}. We compare the optimal max-\ref{['eq:clip_comparison_criterion']} to the heuristic schedule in Equation \ref{['eq:approx_opt_clip']}. Notice that clipping cannot improve SGD in the setting with Gaussian noise while it noticeably improves performance with Rademacher-like noise. Moreover, the heuristic schedule and the optimal schedule perform nearly identically. The unclipped learning rate is constantly $\eta = 0.4$ while $\sigma = 0.8$. We compare Gaussian and Rademacher-like noise with $p = 0.2$. SGD is presented with $80$% confidence intervals over $100$ runs.
  • Figure 5: The CSC and CCC where the noise is Student-t distributed. We hold the variance fixed and vary over the degrees-of-freedom (DOF) parameter. Notice how for high DOF, the thresolds resemble Gaussian behaviour (compare to Figure 2 in the main paper). Meanwhile for small DOF, we see that both the CSC and CCC are high, suggesting clipping is particularly effective. This reflects the heavy-tailed behaviour of the small DOF Student-t distribution.
  • ...and 4 more figures

Theorems & Definitions (25)

  • Definition 1
  • Definition 2: Intrinsic Dimension
  • Definition 3: Clipped Homogenized SGD
  • Theorem 1
  • Example 1: Isotropic data
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • proof
  • Theorem 5
  • ...and 15 more