Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay
Leyan Pan, Xinyuan Cao
TL;DR
The paper analyzes Neural Collapse (NC) — a geometric arrangement where last-layer features within a class collapse and different classes spread as a simplex ETF — through the lens of last-layer Batch Normalization (BN) and Weight Decay (WD). By framing the problem in a layer-peeled model with near-optimal cross-entropy loss, it derives explicit bounds on intra-class and inter-class cosine similarities that quantify NC proximity, demonstrating that BN and sufficiently large WD strengthen NC guarantees. Theoretical results are complemented by extensive experiments on synthetic and real datasets (MNIST, CIFAR-10/100, ImageNet32), which show that BN plus higher WD values yield stronger NC proximity, especially as training loss decreases and last-layer feature norms shrink. Overall, the work provides a new optimization-agnostic perspective on how BN and WD shape feature geometry, with implications for understanding generalization and the role of normalization in deep networks.
Abstract
Neural Collapse (NC) is a geometric structure recently observed at the terminal phase of training deep neural networks, which states that last-layer feature vectors for the same class would "collapse" to a single point, while features of different classes become equally separated. We demonstrate that batch normalization (BN) and weight decay (WD) critically influence the emergence of NC. In the near-optimal loss regime, we establish an asymptotic lower bound on the emergence of NC that depends only on the WD value, training loss, and the presence of last-layer BN. Our experiments substantiate theoretical insights by showing that models demonstrate a stronger presence of NC with BN, appropriate WD values, lower loss, and lower last-layer feature norm. Our findings offer a novel perspective in studying the role of BN and WD in shaping neural network features.
