Sharp Minima Can Generalize: A Loss Landscape Perspective On Data
Raymond Fan, Bryce Sandlund, Lin Myat Ko
TL;DR
This paper investigates how dataset size shapes generalization in deep learning by examining the loss-landscape geometry through basin-volume measurements. It challenges the notion that flat minima alone explain generalization, showing that larger datasets can make previously sharp minima become the largest-volume minima in smaller landscapes, while overall volumes often shrink with more data. The authors extend Monte Carlo basin-volume estimation to minima from larger datasets, reveal a power-law relationship between volume and dataset size, and demonstrate that poisoned data reduce volumes. They also connect volume dynamics to phenomena like sharpness-aware minimization and grokking, arguing for a data-centric view where the loss landscape reshapes as more data are added, changing which minima gradient descent is likely to find. The findings imply that achieving strong generalization may require algorithms that actively navigate sharp but simple minima revealed by larger datasets, rather than solely pursuing flatness.
Abstract
The volume hypothesis suggests deep learning is effective because it is likely to find flat minima due to their large volumes, and flat minima generalize well. This picture does not explain the role of large datasets in generalization. Measuring minima volumes under varying amounts of training data reveals sharp minima which generalize well exist, but are unlikely to be found due to their small volumes. Increasing data changes the loss landscape, such that previously small generalizing minima become (relatively) large.
