Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks
Ke Chen, Chugang Yi, Haizhao Yang
TL;DR
This work analyzes why SGD with Weight Decay generalizes well by revealing a WD-induced low-rank bias in two-layer ReLU networks. It proves that, under mild training conditions, the coefficient matrix $V$ becomes close to a rank-$2$ form, with even tighter rank control under small batch gradients. By leveraging this bias, the authors derive improved generalization bounds, replacing the previous $ ext{O}\left(\sqrt{\frac{mn \ln m \ln N}{N}}\right)$ rate with a tighter $ ext{O}\left(\sqrt{\frac{(m+n) \ln m \ln N}{N}}\right)$ for rank-$2$ networks. Empirical results on California Housing and MNIST corroborate that WD drives $V$ toward low rank and can enhance generalization, supporting a theoretical mechanism for SGD's strong empirical performance.
Abstract
We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is approximately a rank-two matrix. Empirically, we demonstrate that WD is a necessary condition for inducing this low-rank bias across both regression and classification tasks. Our work differs from previous studies as our theoretical analysis does not rely on common assumptions regarding the training data distribution, optimality of weight matrices, or specific training procedures. Furthermore, by leveraging the low-rank bias, we derive improved generalization error bounds and provide numerical evidence showing that better generalization can be achieved. Thus, our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
