Stabilizing Native Low-Rank LLM Pretraining

Paul Janson; Edouard Oyallon; Eugene Belilovsky

Stabilizing Native Low-Rank LLM Pretraining

Paul Janson, Edouard Oyallon, Eugene Belilovsky

TL;DR

This work tackles the instability of training LLMs from scratch with exclusively low-rank weight parameterizations by introducing Spectron, which enforces spectral-norm constraints through adaptive spectral renormalization and gradient orthogonalization. By bounding updates via $\|\,\Delta W\|_2 \le \eta$ with $\rho = \eta/(\|A\|_2+\|B\|_2+1)$ and efficient norm estimation, Spectron enables stable end-to-end native low-rank pretraining without auxiliary full-rank components. Empirically, factorized transformers trained with Spectron match or surpass dense models under equal compute and demonstrate favorable scaling, with compute-optimal exponents $N_{opt} \propto C^{0.479}$ and $D_{opt} \propto C^{0.521}$, implying smaller, more data-driven configurations and substantial inference efficiency gains. The results suggest that native low-rank pretraining can democratize large-scale language modeling by reducing memory and compute requirements while preserving performance, and they provide a principled foundation for future extensions to multimodal architectures and communication-efficient training.

Abstract

Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges. Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary "full-rank" guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduce Spectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models.

Stabilizing Native Low-Rank LLM Pretraining

TL;DR

with

and efficient norm estimation, Spectron enables stable end-to-end native low-rank pretraining without auxiliary full-rank components. Empirically, factorized transformers trained with Spectron match or surpass dense models under equal compute and demonstrate favorable scaling, with compute-optimal exponents

and

, implying smaller, more data-driven configurations and substantial inference efficiency gains. The results suggest that native low-rank pretraining can democratize large-scale language modeling by reducing memory and compute requirements while preserving performance, and they provide a principled foundation for future extensions to multimodal architectures and communication-efficient training.

Abstract

Paper Structure (27 sections, 22 equations, 13 figures, 5 tables, 3 algorithms)

This paper contains 27 sections, 22 equations, 13 figures, 5 tables, 3 algorithms.

Introduction
Related Works
Background and Problem Formulation
Background
The Spectral Instability Problem in Low-Rank Training
Spectron: Spectral Renormalization and Orthogonalization
Experiments
Baselines.
Comparison to Low rank training baselines
Comparison to Dense Model Training
Towards compute optimal Low-Rank Pretraining
Conclusion
Algorithms
Ablations
Effect of Orthogonalization and Spectral Renormalization
...and 12 more sections

Figures (13)

Figure 1: Natively Low-Rank Training Achieves Dense-Level Performance. Validation loss curves comparing a 780M dense Transformer vaswani2017attention(red) against our 454M low-rank factorized Transformer (blue) across $3.5 \times10^6$ training TFLOPs on FineWeb penedo2024fineweb. Our method Spectron enables stable end-to-end factorized training that matches dense performance at equal compute, yielding an inference-optimal model with substantially fewer parameters.
Figure 2: Low-Rank Parameterization Destabilizes Spectral Norm Dynamics. Weight update spectral norm ($\left\lVert \Delta W \right\rVert_{2}$) comparison between low-rank (green) and dense (gray) AdamW kingma2015adam training on layer 4 attention output projection of a Transformer vaswani2017attention. Dense training maintains stable, bounded spectral norms, while low-rank factorization exhibits 10-30$\times$ higher spectral norm magnitudes, revealing that the factorized updates (Equation \ref{['eqn:chain_rule']}) fundamentally cause spectral instability.
Figure 3: Spectral Norm Constraints Stabilize Low-Rank Training. Comparison of (a) weight update spectral norm $\left\lVert \Delta W \right\rVert_{2}$, (b) activation RMS change $\left| \Delta y \right|_{rms}$, and (c) weight spectral norm $\left\lVert W \right\rVert_{2}$ across 8000 training steps for layer 4 attention output projection of a 94M parameter Factorized Transformer vaswani2017attention. AdamW kingma2015adam (green, left axis) exhibits explosive growth in all metrics with unconstrained spectral norm dynamics. Muon jordan2024muon (red, right axis) achieves moderate control through gradient orthogonalization bernstein2024old. Our method, Spectron (blue, right axis) maintains bounded spectral norms throughout training by adaptively constraining factor updates, demonstrating stable optimization. Note that AdamW curves use a different y-axis scale (left) compared to Muon and Spectron (right) for visualization purposes.
Figure 4: Spectrally Normalized Low-Rank Training Outperforms Baselines. Validation loss on FineWeb penedo2024fineweb held-out set during Factorized Transformer-M (297M) pretraining comparing Spectron (blue), self-guided training (red), and naive AdamW (green). Our approach achieves both faster initial convergence and superior final performance (Table \ref{['tab:low-rank-comparison']}), outperforming self-guided training despite its dense auxiliary full rank weights, while maintaining sub-1% computational overhead compared to self-guided's 25% additional FLOPs.
Figure 5: Low-Rank Factorization Matches Dense Performance with Longer Training. Validation loss comparison between Dense Transformer-L (780M parameters) and our Low-Rank Factorized Transformer-L (454M parameters) trained for equal FLOPs by matching training steps. Despite a $\sim42\%$ parameter reduction, our factorized model (blue) converges to the same final validation loss as the dense baseline (red), demonstrating that compute-equivalent training yields an inference-optimal model.
...and 8 more figures

Stabilizing Native Low-Rank LLM Pretraining

TL;DR

Abstract

Stabilizing Native Low-Rank LLM Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (13)