Table of Contents
Fetching ...

Curvature in the Looking-Glass: Optimal Methods to Exploit Curvature of Expectation in the Loss Landscape

Jed A. Duersch, Tommie A. Catanach, Alexander Safonov, Jeremy Wendt

TL;DR

This work presents a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units, and derives the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms.

Abstract

Harnessing the local topography of the loss landscape is a central challenge in advanced optimization tasks. By accounting for the effect of potential parameter changes, we can alter the model more efficiently. Contrary to standard assumptions, we find that the Hessian does not always approximate loss curvature well, particularly near gradient discontinuities, which commonly arise in deep learning architectures. We present a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units. Each ReLU creates a parameter boundary that, when crossed, induces a pseudorandom gradient perturbation. Our derivations show how these discontinuities combine to form a glass-like structure, similar to amorphous solids that contain microscopic domains of strong, but random, atomic alignment. By estimating the density of the resulting gradient variations, we can bound how the loss may change with parameter movement. Our analysis includes the optimal kernel and sample distribution for approximating glass density from ordinary gradient evaluations. We also derive the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms, as well as certain exactness properties that are possible with Nesterov-accelerated gradient updates. Our algorithm, Alice, tests these techniques to determine which curvature terms are most impactful for training a given architecture and dataset. Additional safeguards enforce stable exploitation through step bounds that expand on the functionality of Adam. These theoretical and experimental tools lay groundwork to improve future efforts (e.g., pruning and quantization) by providing new insight into the loss landscape.

Curvature in the Looking-Glass: Optimal Methods to Exploit Curvature of Expectation in the Loss Landscape

TL;DR

This work presents a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units, and derives the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms.

Abstract

Harnessing the local topography of the loss landscape is a central challenge in advanced optimization tasks. By accounting for the effect of potential parameter changes, we can alter the model more efficiently. Contrary to standard assumptions, we find that the Hessian does not always approximate loss curvature well, particularly near gradient discontinuities, which commonly arise in deep learning architectures. We present a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units. Each ReLU creates a parameter boundary that, when crossed, induces a pseudorandom gradient perturbation. Our derivations show how these discontinuities combine to form a glass-like structure, similar to amorphous solids that contain microscopic domains of strong, but random, atomic alignment. By estimating the density of the resulting gradient variations, we can bound how the loss may change with parameter movement. Our analysis includes the optimal kernel and sample distribution for approximating glass density from ordinary gradient evaluations. We also derive the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms, as well as certain exactness properties that are possible with Nesterov-accelerated gradient updates. Our algorithm, Alice, tests these techniques to determine which curvature terms are most impactful for training a given architecture and dataset. Additional safeguards enforce stable exploitation through step bounds that expand on the functionality of Adam. These theoretical and experimental tools lay groundwork to improve future efforts (e.g., pruning and quantization) by providing new insight into the loss landscape.

Paper Structure

This paper contains 32 sections, 6 theorems, 37 equations, 6 figures, 3 algorithms.

Key Result

Theorem 1

Consider a network with many ReLUs. Let a single unit be $z = \max(y,0) \in \mathbb{R}$. To examine the effect of a small parameter perturbation, $\vPa = \vPaCnt + \vPaPrt$, on the gradient variations defined by eq:grd_var, we hold network inputs fixed and assume each ReLU input is locally linearly Suppose a subset of ReLUs have pre-activation values that are uniformly distributed at random in an

Figures (6)

  • Figure 1: Exponential dependence of gradient variations on perturbations from \ref{['eq:power_law']}. A Hessian term yields $p = 2$. No low-order Taylor series can give $p < 2$.
  • Figure 2: Illustration of gradient glass in 2D (top). Gray lines are domain boundaries due to changing a ReLU state. Blue arrows show the resulting gradient perturbations. We can compute the curvature of expected changes (bottom) to the gradient and loss from the density of these variations.
  • Figure 3: ReLU extrapolation. Hessian fitting cannot account for gradient discontinuities, whereas curvature matching gradient changes provides a better fit on a local interval.
  • Figure 4: Alice training with various curvature terms: a) full QN steps, b) NAQ steps (\ref{['thm:naq']}).
  • Figure 5: Training effect of increased exploitation bound.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 1: Glass from ReLUs
  • Theorem 2: Optimal Kernel for Estimating Diagonal
  • Theorem 3: Optimal Perturbation Density
  • Theorem 4: Curvature of Expectation in Glass Loss
  • Theorem 5: Optimal Modification to Quasi-Newton
  • Theorem 6: Exact Nesterov Accelerated Quasi-Newton