Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Alec S. Xu; Can Yaras; Matthew Asato; Qing Qu; Laura Balzano

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Alec S. Xu, Can Yaras, Matthew Asato, Qing Qu, Laura Balzano

TL;DR

This work reveals that training dynamics in nonlinear MLPs with smooth activations concentrate in invariant, low-dimensional subspaces, with a rigorous theory showing that, in two-layer networks trained by gradient descent, weight updates predominantly occur in a fixed subspace whose form is determined at initialization. Empirical evidence extends these findings beyond the theory, showing similar low-rank dynamics in deeper networks and under SGD/Adam with unwhitened data. Leveraging this insight, the authors construct a low-rank MLP parameterization that, when initialized in the appropriate subspaces, achieves near-equivalent classification performance to fully parameterized networks on datasets like Fashion MNIST and CIFAR-10. The results offer a principled explanation for observed low-dimensional training behavior and point toward practical low-rank training and fine-tuning approaches that preserve performance while reducing parameter counts and compute. Overall, the paper advances understanding of nonlinear training dynamics and provides a concrete path to effective low-rank representations in MLPs.

Abstract

Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces. While this has inspired new research into low-rank training, compression, and adaptation, theoretical justification for these dynamics in nonlinear networks remains limited. %compared to deep linear settings. To address this gap, this paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under gradient descent (GD). We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training. Theoretically, we precisely characterize these invariant subspaces for two-layer networks with smooth nonlinear activations, providing insight into their emergence. Experimentally, we validate that this phenomenon extends beyond our theoretical assumptions. Leveraging these insights, we empirically show there exists a low-rank MLP parameterization that, when initialized within the appropriate subspaces, matches the classification performance of fully-parameterized counterparts on a variety of classification tasks.

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

TL;DR

Abstract

Paper Structure (83 sections, 14 theorems, 142 equations, 15 figures)

This paper contains 83 sections, 14 theorems, 142 equations, 15 figures.

Introduction
Related work.
Problem Setup
Notation.
Data.
Network architecture.
Training.
Case Study: Smooth Activations Encourage Lower-Rank Training Dynamics
Analysis on Two-Layer Nonlinear Networks
Definitions and Assumptions
Main Results
Discussion on \ref{['thm:smooth_main_result_main_body']}.
Experimental verification.
Proof Sketch of \ref{['thm:smooth_main_result_main_body']}
Approximate low-rank gradient at initialization.
...and 68 more sections

Key Result

Theorem 3.2

Recall $p := d - 2K$, where $d$ is the data dimension, and $K$ is the label dimension. Let $\bm L_{1, 1}(t)$ and $\bm R_{1, 1}(t)$ denote top-$K$ left and right singular subspaces of $\nabla_{\bm W_1} \mathcal{L}\left( \bm W_1(t) \right)$, and define the singular subspace alignment to initialization Suppose $\bm W_1(0)$ satisfies $\bm W_1^\top(0) \bm W_1(0) = \epsilon^2 \bm I_d$ with $\epsilon \le

Figures (15)

Figure 1: Low rank updates in MLPs with smooth activation functions. Each plot shows how the singular values of the first layer (out of four layers total) evolve throughout training in MLPs with $\operatorname{ELU}$, $\operatorname{GELU}$, and $\operatorname{SiLU}$ activation functions, which are all smooth. We trained each MLP on synthetic data and squared-error loss using gradient descent. Specific experimental details and additional plots of the deeper layer singular values are in \ref{['ssec:additional_sims_main_fig']}.
Figure 2: The middle singular subspace of the first-layer weight matrix in the $\operatorname{ELU}$ network evolves noticeably slower than that in the $\operatorname{ReLU}$ network, and the corresponding singular values remain closer to their initialization.
Figure 3: Under our exact theoretical setting, $\widetilde{\bm W}_{1, 1}(t)$ accounts for almost all of the change in $\widetilde{\bm W}_1(t)$, and thus $\bm W_1(t)$.
Figure 4: For every $l \in [L - 1]$ in deep MLPs with smooth activations, $\widetilde{\bm W}_{l, 1}(t)$ accounts for almost all of the change in $\widetilde{\bm W}_l(t)$.
Figure 5: Training deep MLPs with SGD plus momentum (top row), or with Adam (bottom row), on unwhitened input data using cross-entropy loss approximately maintains the previously observed low-rank training dynamics.
...and 10 more figures

Theorems & Definitions (27)

Definition 3.1: Principal angles between subspaces
Theorem 3.2: Simplified
Proposition B.1
Lemma D.1
proof
Lemma D.2
proof
Lemma D.3
proof
Lemma D.4
...and 17 more

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

TL;DR

Abstract

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (27)