Table of Contents
Fetching ...

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

Hongyang R. Zhang, Dongyue Li, Haotian Ju

TL;DR

The paper tackles the challenge of improving generalization for over-parameterized neural networks by explicitly regularizing the Hessian trace to favor flat loss surfaces. It introduces Noise Stability Optimization (NSO), a two-point noise injection scheme that yields an approximately unbiased estimate of the Hessian trace and can be combined with traditional regularizers. A data-dependent PAC-Bayes generalization bound ties the Hessian trace and weight-radius to generalization, and the authors provide a proof sketch anchored in a Gaussian posterior and Taylor expansion. Empirically, NSO reduces the Hessian's trace and largest eigenvalue, improves test accuracy on multiple tasks, and remains effective when applied to pretraining (e.g., CLIP) and chain-of-thought fine-tuning, outperforming several sharpness-minimization baselines under matched compute. The work also analyzes convergence rates and a matrix sensing case, offering both theoretical guarantees and practical guidance for deploying Hessian-focused regularization in diverse settings.

Abstract

The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

TL;DR

The paper tackles the challenge of improving generalization for over-parameterized neural networks by explicitly regularizing the Hessian trace to favor flat loss surfaces. It introduces Noise Stability Optimization (NSO), a two-point noise injection scheme that yields an approximately unbiased estimate of the Hessian trace and can be combined with traditional regularizers. A data-dependent PAC-Bayes generalization bound ties the Hessian trace and weight-radius to generalization, and the authors provide a proof sketch anchored in a Gaussian posterior and Taylor expansion. Empirically, NSO reduces the Hessian's trace and largest eigenvalue, improves test accuracy on multiple tasks, and remains effective when applied to pretraining (e.g., CLIP) and chain-of-thought fine-tuning, outperforming several sharpness-minimization baselines under matched compute. The work also analyzes convergence rates and a matrix sensing case, offering both theoretical guarantees and practical guidance for deploying Hessian-focused regularization in diverse settings.

Abstract

The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.
Paper Structure (41 sections, 16 theorems, 160 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 41 sections, 16 theorems, 160 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Theorem 2.1

Assume that the loss function $\ell$ is bounded between $0$ and $C$ for a fixed constant $C > 0$ on the data distribution $\mathcal{D}$. Suppose $\ell(f_W(\cdot), \cdot)$ is twice-differentiable in $W$ and the Hessian matrix $\nabla^2[\ell(f_W(\cdot), \cdot)]$ is Lipschitz continuous within the hypo and the $\ell_2$-norm of $W$ is at most $r$ for any $W \in \mathcal{H}$. Then, for any $W$ in $\mat

Figures (6)

  • Figure 1: An illustration of one update step in our algorithm. At each iteration $i$, we sample a random variable $U_i$ from a zero-mean distribution $\mathcal{P}$ (e.g., an isotropic Gaussian with variance $\sigma^2$), where $\sigma$ is a hyper-parameter that controls the strength of the noise injection (hence the regularization). We query the gradient of $f$, at $f(W_i + U_i)$, and $f(W_i - U_i)$, and take their average. This results in a two-point noise injection scheme, whose computation cost is the same as sharpness-aware minimization foret2020sharpness, and twice the cost of running SGD. Notice that in practice, we can also implement an extension of this algorithm, which samples multiple $U$s. For details, see Algorithm \ref{['alg:two_point']}.
  • Figure 2: Illustration of the approximation quality of equation \ref{['eq_tay']}. We report all measurements based on the network weight at the last epoch of fine-tuning. We can see that the perturbation gap (i.e., $F(W) - f(W)$ in equation \ref{['eq_tay']}) and $\frac{\sigma^2}{2} \mathop{\mathrm{Tr}}\nolimits[\nabla^2 f(W)]$ are at the same order. Recall that $\sigma$ refers to the standard deviation of the Gaussian noise injected into the weight matrices. More specifically, $\sigma$ will decide the strength of noise injection or the strength of regularization on the Hessian trace.
  • Figure 3: Comparison between SGD, WP-SGD, and NSO for fine-tuning ResNet-34 and BERT-Base, respectively, on an image and a text classification dataset. We evaluate the test loss, the trace of the Hessian, and the generalization gap for the trained model at each epoch. For WP-SGD and NSO, we sample noise from isotropic Gaussian with standard deviation $\sigma=0.01$ in both settings.
  • Figure 4: Results of varying the learning rate and the number of epochs for running our approach and WP-SGD. We report the test loss from the last epoch and average the results over five random seeds.
  • Figure 5: Results of varying the batch size of our approach and SAM ran on two image classification datasets (indoor scene recognition and Aircraft detection). We report the test loss and the trace of Hessian using the model from the last epoch of training. The results are averaged over five random seeds. The regularization provided by noise injection can be combined with distance-based regularization and data augmentation to reduce the test loss and the Hessian trace.
  • ...and 1 more figures

Theorems & Definitions (35)

  • Theorem 2.1
  • Remark 2.2
  • Remark 3.1: Noise variance scheduling as $k$ increases
  • Proposition 4.2
  • Theorem 4.3
  • Remark 4.4
  • Proposition 5.1
  • Theorem A.1
  • Proposition A.2
  • Claim A.3
  • ...and 25 more