Flatness After All?

Neta Shoham; Liron Mor-Yosef; Haim Avron

Flatness After All?

Neta Shoham, Liron Mor-Yosef, Haim Avron

TL;DR

This work tackles the generalization–flatness paradox in deep networks by introducing a soft-rank measure of Hessian flatness, defined as $\operatorname{rank}_\lambda(\mathbf{H}(\boldsymbol{\theta}^*)) = \operatorname{Tr}(\mathbf{H}(\boldsymbol{\theta}^*)(\mathbf{H}(\boldsymbol{\theta}^*) + \lambda \mathbf{I})^{-1})$, and showing it precisely captures asymptotic generalization gaps for calibrated exponential-family neural networks. It further connects non-calibrated models to Takeuchi Information Criterion, demonstrating that the generalization gap can be decomposed into a TIC-like term and the soft rank, with robust estimability from training data when the model is not overly confident. The paper provides theoretical results linking calibration, Tikhonov regularization, and information matrices, and shows that the soft rank of the Fisher Information Matrix (FIM) and related concave trace penalties outperform traditional Hessian norms as predictors of generalization. It also develops efficient estimation techniques (e.g., using KFAC or diagonal FIM approximations) and validates them across MNIST, CIFAR-10, and SVHN, highlighting the practical impact for model selection and transfer learning. Overall, the soft-rank framework offers a principled, scalable, and calibration-aware approach to quantifying and predicting generalization in modern neural networks.

Abstract

Recent literature generalization in deep learning has examined the relationship between the curvature of the loss function at minima and generalization, mainly in the context of overparameterized neural networks. A key observation is that "flat" minima tend to generalize better than "sharp" minima. While this idea is supported by empirical evidence, it has also been shown that deep networks can generalize even with arbitrary sharpness, as measured by either the trace or the spectral norm of the Hessian. In this paper, we argue that generalization could be assessed by measuring flatness using a soft rank measure of the Hessian. We show that when an exponential family neural network model is exactly calibrated, and its prediction error and its confidence on the prediction are not correlated with the first and the second derivative of the network's output, our measure accurately captures the asymptotic expected generalization gap. For non-calibrated models, we connect a soft rank based flatness measure to the well-known Takeuchi Information Criterion and show that it still provides reliable estimates of generalization gaps for models that are not overly confident. Experimental results indicate that our approach offers a robust estimate of the generalization gap compared to baselines.

Flatness After All?

TL;DR

This work tackles the generalization–flatness paradox in deep networks by introducing a soft-rank measure of Hessian flatness, defined as

, and showing it precisely captures asymptotic generalization gaps for calibrated exponential-family neural networks. It further connects non-calibrated models to Takeuchi Information Criterion, demonstrating that the generalization gap can be decomposed into a TIC-like term and the soft rank, with robust estimability from training data when the model is not overly confident. The paper provides theoretical results linking calibration, Tikhonov regularization, and information matrices, and shows that the soft rank of the Fisher Information Matrix (FIM) and related concave trace penalties outperform traditional Hessian norms as predictors of generalization. It also develops efficient estimation techniques (e.g., using KFAC or diagonal FIM approximations) and validates them across MNIST, CIFAR-10, and SVHN, highlighting the practical impact for model selection and transfer learning. Overall, the soft-rank framework offers a principled, scalable, and calibration-aware approach to quantifying and predicting generalization in modern neural networks.

Flatness After All?

TL;DR

Abstract

Flatness After All?

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (30)