Table of Contents
Fetching ...

Sharp Minima Can Generalize: A Loss Landscape Perspective On Data

Raymond Fan, Bryce Sandlund, Lin Myat Ko

TL;DR

This paper investigates how dataset size shapes generalization in deep learning by examining the loss-landscape geometry through basin-volume measurements. It challenges the notion that flat minima alone explain generalization, showing that larger datasets can make previously sharp minima become the largest-volume minima in smaller landscapes, while overall volumes often shrink with more data. The authors extend Monte Carlo basin-volume estimation to minima from larger datasets, reveal a power-law relationship between volume and dataset size, and demonstrate that poisoned data reduce volumes. They also connect volume dynamics to phenomena like sharpness-aware minimization and grokking, arguing for a data-centric view where the loss landscape reshapes as more data are added, changing which minima gradient descent is likely to find. The findings imply that achieving strong generalization may require algorithms that actively navigate sharp but simple minima revealed by larger datasets, rather than solely pursuing flatness.

Abstract

The volume hypothesis suggests deep learning is effective because it is likely to find flat minima due to their large volumes, and flat minima generalize well. This picture does not explain the role of large datasets in generalization. Measuring minima volumes under varying amounts of training data reveals sharp minima which generalize well exist, but are unlikely to be found due to their small volumes. Increasing data changes the loss landscape, such that previously small generalizing minima become (relatively) large.

Sharp Minima Can Generalize: A Loss Landscape Perspective On Data

TL;DR

This paper investigates how dataset size shapes generalization in deep learning by examining the loss-landscape geometry through basin-volume measurements. It challenges the notion that flat minima alone explain generalization, showing that larger datasets can make previously sharp minima become the largest-volume minima in smaller landscapes, while overall volumes often shrink with more data. The authors extend Monte Carlo basin-volume estimation to minima from larger datasets, reveal a power-law relationship between volume and dataset size, and demonstrate that poisoned data reduce volumes. They also connect volume dynamics to phenomena like sharpness-aware minimization and grokking, arguing for a data-centric view where the loss landscape reshapes as more data are added, changing which minima gradient descent is likely to find. The findings imply that achieving strong generalization may require algorithms that actively navigate sharp but simple minima revealed by larger datasets, rather than solely pursuing flatness.

Abstract

The volume hypothesis suggests deep learning is effective because it is likely to find flat minima due to their large volumes, and flat minima generalize well. This picture does not explain the role of large datasets in generalization. Measuring minima volumes under varying amounts of training data reveals sharp minima which generalize well exist, but are unlikely to be found due to their small volumes. Increasing data changes the loss landscape, such that previously small generalizing minima become (relatively) large.

Paper Structure

This paper contains 35 sections, 15 equations, 31 figures, 2 tables.

Figures (31)

  • Figure 1: Top: Models trained on larger fractions of MNIST achieve higher test accuracy. Bottom: 2D slices of the training loss landscape, containing models obtained by training on A: 100% of MNIST (60,000 examples), B: 10% of MNIST, and C: 1% of MNIST. Only model A is a viable minima on the landscape from all training data (right). Yet in the 1% training data landscape (left---the same data we train C on) all three models appear viable minima. Training on this landscape however never yields minima like A or B. The volume hypothesis suggests this is because the volume associated with model C's minima is much larger than A or B.
  • Figure 2: Left: Cartoon for a scenario where minima found on a dataset is larger than all other minima. Previous experiments observed this case when comparing to minima from a poisoned dataset huang_understanding_2020scherlis_estimating_2025. Right: Minima found at a given dataset size (e.g., green dot) is larger than minima found at other sizes (blue and yellow curves). Note the hypothesis does not predict how absolute minima volume scales with dataset size. If there are simple scaling relations like the lines shown here, one could imagine special algorithms targeting minima which generalize well via their scaling properties.
  • Figure 3: Left: Illustrating how generalization can be described by volume dynamics even if the minima found is smaller than other candidate minima. The found minima belongs to a popular class of minima with similar test loss, such that the volume of the class is largest overall. Right: Minima found at a given dataset size (e.g., green dot) are smaller than minima obtained at larger dataset sizes (blue and yellow curves). Volume-data scaling for individual minima here are inspired by our experimental results.
  • Figure 4: Left: Monte Carlo basin volume estimation measures distances to the basin boundary along random directions, yielding a star-convex estimate (red) that underestimates the true volume (blue). Right: Random directions can align with flat axes in a small basin, making it appear larger than other minima. This issue depends on the shape of the minima and is exacerbated by high dimensionality.
  • Figure 5: Multiplying one layer by a factor $\alpha$ and the following layer by $\beta = 1/\alpha$ leaves a neural network unchanged. This scale invariance is problematic for flatness measures as they may not return the same values for the minima on the left as the right, despite being identical models. Note this invariance also implies minima volumes are infinite. Interestingly, for the two plots above, star convex basin volume is actually identical, which suggests it may be more robust than other flatness measures. (See Appendix \ref{['app:analytical_basin_volume']})
  • ...and 26 more figures