Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
Ting Han, Linara Adilova, Henning Petzka, Jens Kleesiek, Michael Kamp
TL;DR
The paper investigates whether neural collapse (NC) or loss landscape flatness causally underpins generalization. Using grokking to temporally separate memorization from generalization, it shows NC can emerge without being necessary for generalization, while relative flatness consistently aligns with when generalization appears. The authors also demonstrate that actively increasing flatness delays generalization (grokking-like behavior) across diverse architectures and tasks, and that NC can, in fact, be suppressed without harming generalization. Theoretically, NC implies relative flatness under classical assumptions, tying these phenomena together, while representativeness of learned features remains essential for generalization. Overall, the work positions relative flatness as a more fundamental driver of generalization than NC and suggests grokking as a powerful probe into the geometry of learning.
Abstract
Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization, resembling grokking, even in architectures and datasets where it does not typically occur. Furthermore, we show theoretically that neural collapse leads to relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.
