Table of Contents
Fetching ...

Flatness After All?

Neta Shoham, Liron Mor-Yosef, Haim Avron

TL;DR

This work tackles the generalization–flatness paradox in deep networks by introducing a soft-rank measure of Hessian flatness, defined as $\operatorname{rank}_\lambda(\mathbf{H}(\boldsymbol{\theta}^*)) = \operatorname{Tr}(\mathbf{H}(\boldsymbol{\theta}^*)(\mathbf{H}(\boldsymbol{\theta}^*) + \lambda \mathbf{I})^{-1})$, and showing it precisely captures asymptotic generalization gaps for calibrated exponential-family neural networks. It further connects non-calibrated models to Takeuchi Information Criterion, demonstrating that the generalization gap can be decomposed into a TIC-like term and the soft rank, with robust estimability from training data when the model is not overly confident. The paper provides theoretical results linking calibration, Tikhonov regularization, and information matrices, and shows that the soft rank of the Fisher Information Matrix (FIM) and related concave trace penalties outperform traditional Hessian norms as predictors of generalization. It also develops efficient estimation techniques (e.g., using KFAC or diagonal FIM approximations) and validates them across MNIST, CIFAR-10, and SVHN, highlighting the practical impact for model selection and transfer learning. Overall, the soft-rank framework offers a principled, scalable, and calibration-aware approach to quantifying and predicting generalization in modern neural networks.

Abstract

Recent literature generalization in deep learning has examined the relationship between the curvature of the loss function at minima and generalization, mainly in the context of overparameterized neural networks. A key observation is that "flat" minima tend to generalize better than "sharp" minima. While this idea is supported by empirical evidence, it has also been shown that deep networks can generalize even with arbitrary sharpness, as measured by either the trace or the spectral norm of the Hessian. In this paper, we argue that generalization could be assessed by measuring flatness using a soft rank measure of the Hessian. We show that when an exponential family neural network model is exactly calibrated, and its prediction error and its confidence on the prediction are not correlated with the first and the second derivative of the network's output, our measure accurately captures the asymptotic expected generalization gap. For non-calibrated models, we connect a soft rank based flatness measure to the well-known Takeuchi Information Criterion and show that it still provides reliable estimates of generalization gaps for models that are not overly confident. Experimental results indicate that our approach offers a robust estimate of the generalization gap compared to baselines.

Flatness After All?

TL;DR

This work tackles the generalization–flatness paradox in deep networks by introducing a soft-rank measure of Hessian flatness, defined as , and showing it precisely captures asymptotic generalization gaps for calibrated exponential-family neural networks. It further connects non-calibrated models to Takeuchi Information Criterion, demonstrating that the generalization gap can be decomposed into a TIC-like term and the soft rank, with robust estimability from training data when the model is not overly confident. The paper provides theoretical results linking calibration, Tikhonov regularization, and information matrices, and shows that the soft rank of the Fisher Information Matrix (FIM) and related concave trace penalties outperform traditional Hessian norms as predictors of generalization. It also develops efficient estimation techniques (e.g., using KFAC or diagonal FIM approximations) and validates them across MNIST, CIFAR-10, and SVHN, highlighting the practical impact for model selection and transfer learning. Overall, the soft-rank framework offers a principled, scalable, and calibration-aware approach to quantifying and predicting generalization in modern neural networks.

Abstract

Recent literature generalization in deep learning has examined the relationship between the curvature of the loss function at minima and generalization, mainly in the context of overparameterized neural networks. A key observation is that "flat" minima tend to generalize better than "sharp" minima. While this idea is supported by empirical evidence, it has also been shown that deep networks can generalize even with arbitrary sharpness, as measured by either the trace or the spectral norm of the Hessian. In this paper, we argue that generalization could be assessed by measuring flatness using a soft rank measure of the Hessian. We show that when an exponential family neural network model is exactly calibrated, and its prediction error and its confidence on the prediction are not correlated with the first and the second derivative of the network's output, our measure accurately captures the asymptotic expected generalization gap. For non-calibrated models, we connect a soft rank based flatness measure to the well-known Takeuchi Information Criterion and show that it still provides reliable estimates of generalization gaps for models that are not overly confident. Experimental results indicate that our approach offers a robust estimate of the generalization gap compared to baselines.

Paper Structure

This paper contains 26 sections, 13 theorems, 103 equations, 8 figures.

Key Result

Proposition 2.1

If $\mathbf{\boldsymbol{\theta}}^\star$ is a local minimizer,

Figures (8)

  • Figure 1: Gap estimation with $\mathbf{C}(\hat{\mathbf{\boldsymbol{\theta}}})$ and $\mathbf{F}(\hat{\mathbf{\boldsymbol{\theta}}})$ computed on the test data. The leftmost plot shows the regularized TIC; the middle plot shows the gap estimator based on thomas2020interplay; and the rightmost plot shows our approach.
  • Figure 2: Predicting generalization gap factors from training data. Left: Gap versus $\frac{\operatorname{Tr}(\mathbf{C})}{\operatorname{Tr}(\mathbf{F})}$ shows poor correlation due to overfitting; Right: Gap versus $\operatorname{rank}_\mathbf{\Lambda}(\mathbf{F})$ demonstrates strong correlation despite overfitting (orange: training, blue: test).
  • Figure 3: Cheap generalization gap estimation. Left, the trace ratio suggested by thomas2020interplay severely underestimates the gap as it lacks the rank factor. Middle, a diagonal approximation of $\mathbf{F}(\hat{\mathbf{\boldsymbol{\theta}}})$ is substituted in the soft rank-based gap estimation, clearly overestimates the gap as the theory predicts. Right, soft rank-based metric with KFAC approximation of $\mathbf{F}(\hat{\mathbf{\boldsymbol{\theta}}})$, smaller gap overestimation.
  • Figure 4: Kendall $\tau$ correlations between training-set-complexity measures and the generalization gap
  • Figure 5: (Left) The ratio is measured on the test set. We see the relationship between model calibration (x‐axis: better calibration toward the left) and performance metrics, showing improved performance with better calibration. (Right) The ratio is measured on the train set.We see the impact of overfitting (x‐axis: increased overfitting toward the left) on performance, highlighting degradation, especially for the soft-rank metric.
  • ...and 3 more figures

Theorems & Definitions (30)

  • Proposition 2.1
  • proof
  • Definition 2.2: Calibrated Neural Network Model
  • Proposition 2.3
  • proof
  • Proposition 4.1
  • proof
  • Proposition 4.2
  • proof
  • Proposition 4.3
  • ...and 20 more