Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations
Alec S. Xu, Can Yaras, Matthew Asato, Qing Qu, Laura Balzano
TL;DR
This work reveals that training dynamics in nonlinear MLPs with smooth activations concentrate in invariant, low-dimensional subspaces, with a rigorous theory showing that, in two-layer networks trained by gradient descent, weight updates predominantly occur in a fixed subspace whose form is determined at initialization. Empirical evidence extends these findings beyond the theory, showing similar low-rank dynamics in deeper networks and under SGD/Adam with unwhitened data. Leveraging this insight, the authors construct a low-rank MLP parameterization that, when initialized in the appropriate subspaces, achieves near-equivalent classification performance to fully parameterized networks on datasets like Fashion MNIST and CIFAR-10. The results offer a principled explanation for observed low-dimensional training behavior and point toward practical low-rank training and fine-tuning approaches that preserve performance while reducing parameter counts and compute. Overall, the paper advances understanding of nonlinear training dynamics and provides a concrete path to effective low-rank representations in MLPs.
Abstract
Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces. While this has inspired new research into low-rank training, compression, and adaptation, theoretical justification for these dynamics in nonlinear networks remains limited. %compared to deep linear settings. To address this gap, this paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under gradient descent (GD). We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training. Theoretically, we precisely characterize these invariant subspaces for two-layer networks with smooth nonlinear activations, providing insight into their emergence. Experimentally, we validate that this phenomenon extends beyond our theoretical assumptions. Leveraging these insights, we empirically show there exists a low-rank MLP parameterization that, when initialized within the appropriate subspaces, matches the classification performance of fully-parameterized counterparts on a variety of classification tasks.
