Table of Contents
Fetching ...

Neglected Hessian component explains mysteries in Sharpness regularization

Yann N. Dauphin, Atish Agarwala, Hossein Mobahi

TL;DR

The paper argues that a neglected component of the Hessian, the Nonlinear Modeling Error (NME), is crucial for understanding how second-order information affects deep learning generalization. It decomposes the Hessian as $\nabla_{\boldsymbol{\theta}}^2 \mathcal{L} = \mathbf{J}^{\top} \mathbf{H}_{\mathbf{z}} \mathbf{J} + \nabla_{\mathbf{z}}\mathcal{L} \cdot \nabla^2_{\boldsymbol{\theta}} \mathbf{z}$, separating feature exploitation (GN) from feature exploration (NME). Empirically, activation second derivatives strongly shape the NME, making gradient penalties effective for GELU-like activations but often harmful for ReLU unless augmented with explicit second-derivative information; conversely, Hessian penalties that directly regularize the NME can harm generalization, while Gauss-Newton penalties (ignoring NME in the loss but incorporating it in updates) can improve performance. The work also draws connections to SAM, showing it implicitly samples NME information and is less sensitive to activation choice, and it suggests design principles for activation functions and second-order methods to better leverage second-order information in deep networks.

Abstract

Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.

Neglected Hessian component explains mysteries in Sharpness regularization

TL;DR

The paper argues that a neglected component of the Hessian, the Nonlinear Modeling Error (NME), is crucial for understanding how second-order information affects deep learning generalization. It decomposes the Hessian as , separating feature exploitation (GN) from feature exploration (NME). Empirically, activation second derivatives strongly shape the NME, making gradient penalties effective for GELU-like activations but often harmful for ReLU unless augmented with explicit second-derivative information; conversely, Hessian penalties that directly regularize the NME can harm generalization, while Gauss-Newton penalties (ignoring NME in the loss but incorporating it in updates) can improve performance. The work also draws connections to SAM, showing it implicitly samples NME information and is less sensitive to activation choice, and it suggests design principles for activation functions and second-order methods to better leverage second-order information in deep networks.

Abstract

Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.
Paper Structure (23 sections, 49 equations, 6 figures)

This paper contains 23 sections, 49 equations, 6 figures.

Figures (6)

  • Figure 1: Loss (left) and Nonlinear Modeling Error matrix (NME) norm (right) as a function of $2$ parameters in the same hidden layer of an MLP (MSE loss, one datapoint). For ReLU activation model is piecewise multilinear, and piecewise linear for parameters in same layer. Loss is piecewise quadratic for parameters in same layer (left). There is little NME information accessible pointwise and the main features are the boundaries of the piecewise linear regions (blue, right). For $\beta$-GELU, NME magnitude is high only within distance $1/\beta$ of those boundaries. Therefore the NME encodes information about the utility of switching between piecewise multilinear regions.
  • Figure 2: Test Accuracy vs. $\rho$ for ReLU and GELU networks trained with gradient penalty ($p=1$, averaged over $5$ seeds). In both cases performance is similar without regularization but with regularization test accuracy increases for GELU until $\rho = 0.1$ and decreases for ReLU over a similar range.
  • Figure 3: Accuracy vs $\beta$ for SGD and SGD with gradient penalty ($\rho=0.1$) using $\beta$-GELU activations (average of $5$ seeds). We observe that accuracy decreases with larger $\beta$ with the gradient penalty but not without it. As our theory suggests that the sparsity of the NME increases with $\beta$, this is evidence that it has significant impact on gradient penalties.
  • Figure 4: Test accuracy as $\rho$ increases for Augmented ReLU and Diminished GELU (average of $5$ seeds). The addition or removal of information from the NME controls the effectiveness of the gradient penalty.
  • Figure 5: Test Accuracy as $\sigma^2$ increases across different datasets and activation functions averaged over $5$ seeds. Large $\sigma^2$ reveals a stark contrast between the Gauss-Newton trace penalty (excluding NME) and methods incorporating it, highlighting the NME's influence.
  • ...and 1 more figures