Table of Contents
Fetching ...

Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization

Qiaozhe Zhang, Jun Sun, Ruijie Zhang, Yingzhuang Liu

TL;DR

This work introduces Rényi sharpness, defined as the negative Rényi entropy $-H_{oldsymbol{}}( extbf{H})$ of the normalized Hessian spectrum, to capture spectrum spread and its relation to generalization. It establishes reparameterization-invariant bounds that connect population risk to Rényi sharpness, and provides a practical estimator using stochastic Lanczos quadrature for large Hessians. The authors demonstrate a strong Kendall correlation between Rényi sharpness and generalization across multiple architectures and datasets, outperforming traditional sharpness metrics. They further propose RSAM, a computationally efficient regularizer that encourages lower Rényi sharpness during training and achieves improvements over SAM-based methods on several benchmarks. Overall, the paper combines theory and applied methodology to link Hessian spectral structure with generalization, offering a scalable path to improved training through Rényi sharpness regularization.

Abstract

Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{Rényi sharpness}, which is defined as the negative Rényi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{Rényi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (Rényi) entropy. To rigorously establish the relationship between generalization and (Rényi) sharpness, we provide several generalization bounds in terms of Rényi sharpness, by taking advantage of the reparametrization invariance property of Rényi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the Rényi sharpness and generalization. Moreover, we propose to use a variant of Rényi Sharpness as regularizer during training, i.e., Rényi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.

Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization

TL;DR

This work introduces Rényi sharpness, defined as the negative Rényi entropy of the normalized Hessian spectrum, to capture spectrum spread and its relation to generalization. It establishes reparameterization-invariant bounds that connect population risk to Rényi sharpness, and provides a practical estimator using stochastic Lanczos quadrature for large Hessians. The authors demonstrate a strong Kendall correlation between Rényi sharpness and generalization across multiple architectures and datasets, outperforming traditional sharpness metrics. They further propose RSAM, a computationally efficient regularizer that encourages lower Rényi sharpness during training and achieves improvements over SAM-based methods on several benchmarks. Overall, the paper combines theory and applied methodology to link Hessian spectral structure with generalization, offering a scalable path to improved training through Rényi sharpness regularization.

Abstract

Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{Rényi sharpness}, which is defined as the negative Rényi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{Rényi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (Rényi) entropy. To rigorously establish the relationship between generalization and (Rényi) sharpness, we provide several generalization bounds in terms of Rényi sharpness, by taking advantage of the reparametrization invariance property of Rényi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the Rényi sharpness and generalization. Moreover, we propose to use a variant of Rényi Sharpness as regularizer during training, i.e., Rényi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.

Paper Structure

This paper contains 47 sections, 20 theorems, 78 equations, 14 figures, 6 tables, 2 algorithms.

Key Result

Proposition 2.2

Consider a $L$-layer feedforward neural network with positively homogeneous activation function $\sigma$ (i.e., $\sigma(c \mathbf{x}) = c \sigma(\mathbf{x})$ for all $c > 0$), and parameters $\{\mathbf{W}_1, \ldots, \mathbf{W}_L\}$. Let the network output be $f(\mathbf{x}) = \mathbf{W}_L \cdot \sigm

Figures (14)

  • Figure 1: Hessian spectra [a,b,c]. Two zero-dominant profiles are observed: (a) multi-cluster and (b,c) uniform. Optimal $\alpha$ vs. Hessian spectral type [d]. Statistics summarizing whether the empirically optimal $\alpha$ matches the predicted choice under each Hessian spectral type.
  • Figure 2: ResNet18 on CIFAR10, The layer 1 to all layer subplots correspond to the Rényi sharpness measure. Rényi sharpness is strongly correlated with generalization than the other measures.
  • Figure 3: Kendall correlations on various tasks. Signed coefficients are mapped to 0–1 (blue = positive, green = negative). Rényi sharpness shows the strongest correlation with generalization than other sharpness measures.
  • Figure 4: Spectrum of ResNet18 on CIFAR10.
  • Figure 5: Spectrum of ResNet34 on CIFAR10.
  • ...and 9 more figures

Theorems & Definitions (22)

  • Definition 2.1: Rényi Sharpness
  • Proposition 2.2: Reparameterizaiton Invariance of Rényi Sharpness
  • Proposition 3.1: informally
  • Theorem 3.2: informally
  • Theorem 3.3: informally
  • Theorem B.1
  • Theorem B.2: Hoeffding's inequaliy
  • Corollary B.3
  • Proposition C.1
  • Theorem C.2
  • ...and 12 more