InRank: Incremental Low-Rank Learning

Jiawei Zhao; Yifei Zhang; Beidi Chen; Florian Schäfer; Anima Anandkumar

InRank: Incremental Low-Rank Learning

Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar

TL;DR

This paper generalizes Greedy Low-Rank Learning (GLRL) by removing the impractical infinitesimal initialization, showing that cumulative weight updates follow an incremental low-rank trajectory under arbitrary orthogonal initialization in a 3-layer linear network. It then introduces Incremental Low-Rank Learning (InRank), a practical scheme that parameterizes updates as low-rank matrices and dynamically increases rank based on an explained-ratio criterion, enabling automatic discovery of intrinsic network rank. Empirically, InRank achieves comparable performance to full-rank GPT-2 training on WikiText-103 while using at most 33% of total ranks, and an InRank-Efficient variant reduces training time by up to 37% and model size by up to 36% with modest memory gains. The approach promises substantial practical benefits for training efficiency and scalability, with broad applicability to transformers and other architectures and a clear path toward broader adoption in resource-constrained settings.

Abstract

The theory of greedy low-rank learning (GLRL) aims to explain the impressive generalization capabilities of deep learning. It proves that stochastic gradient-based training implicitly regularizes neural networks towards low-rank solutions through a gradual increase of the rank during training. However, there is a gap between theory and practice since GLRL requires an infinitesimal initialization of the weights, which is not practical due to the fact that it is a saddle point. In this work, we remove the assumption of infinitesimal initialization by focusing on cumulative weight updates. We prove the cumulative weight updates follow an incremental low-rank trajectory for arbitrary orthogonal initialization of weights in a three-layer linear network. Empirically, we demonstrate that our theory holds on a broad range of neural networks (e.g., transformers) and standard training algorithms (e.g., SGD, Adam). However, existing training algorithms do not exploit the low-rank property to improve computational efficiency as the networks are not parameterized in low-rank. To remedy this, we design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices while incrementally augmenting their ranks during training. We evaluate InRank on GPT-2, and our results indicate that InRank achieves comparable prediction performance as the full-rank counterpart while requiring at most 33% of the total ranks throughout training. We also propose an efficient version of InRank that achieves a reduction of 37% in total training time and 36% in model size when training GPT-medium on WikiText-103 from scratch.

InRank: Incremental Low-Rank Learning

TL;DR

Abstract

Paper Structure (28 sections, 3 theorems, 14 equations, 21 figures, 5 tables, 3 algorithms)

This paper contains 28 sections, 3 theorems, 14 equations, 21 figures, 5 tables, 3 algorithms.

Introduction
Related Work
Preliminary: Greedy Low-Rank Learning
Cumulative Weight Updates follow Low-Rank Learning Trajectory
Incremental Learning
Evaluation
Automatic Rank Determination
InRank-Efficient
Comparison
Selection of Threshold $\alpha$
Conclusion
Social Impact
Proof
Clarification on Greedy Low-Rank Learning
InRank-Efficient
...and 13 more sections

Key Result

Theorem 1

If initialize $W^1,...,W^L$ to be infinitesimal, then the product matrix $A_{\theta}$ follows a greedy low-rank learning trajectory, such that the gradient descent first searches over a rank-1 subspace of $A_{\theta}$, and then greedily increases the rank by one whenever it fails to reach a local mi

Figures (21)

Figure 1: Incremental Low-Rank Learning from iteration $t_1$ to $t_2$.$U$ and $V$ represent any factorized layer. Density plots indicate the strength of each singular vector (normalized by the total strengths). Solid areas represent how much information in the spectrum is explained by the current rank $r_t$ at iteration $t$. From iteration $t_1$ to $t_2$, InRank adds $r_{2} - r_{1}$ additional ranks to ensure the ratio of the explained information is greater than a certain threshold $\alpha$.
Figure 2: $u_f(t)$ follows low-rank learning trajectory regardless of $s$ and $u_0$. We generate a set of $s$ given $s_i = a \times i, i=1,...,10$ while varying $a$ from $0.1$ to $1.0$. We also generate a set of $u_0$ given $u_0 \sim \mathcal{N}(0, b^2)$. Darker colors indicate singular vectors with higher strengths.
Figure 3: The evolutions of all singular vectors of cumulative weight updates $D_t$ over training under different initializations. They are evaluated on the training of a 2-layer perceptron on Fashion MNIST. Darker colors indicate singular vectors with higher strengths.
Figure 4: Identifying intrinsic rank in GPT-small on WikiText-103. The cross marker signifies the rank determined by InRank. The rank varies from 25 to 400.
Figure 5: The rank evolution in various MLP layers when applying InRank on GPT-small model.
...and 16 more figures

Theorems & Definitions (4)

Theorem 1: Greedy Low-Rank Learning, informal
Definition 1: Cumulative Weight Updates
Theorem 2
Theorem 3

InRank: Incremental Low-Rank Learning

TL;DR

Abstract

InRank: Incremental Low-Rank Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (4)