Randomized Gradient Subspaces for Efficient Large Language Model Training
Sahar Rajabi, Nayeema Nonta, Samanvay Vajpayee, Sirisha Rambhatla
TL;DR
This work tackles the memory bottleneck in large language model training by analyzing the gradient subspace dynamics and introducing two randomized, subspace-aware methods, GrassWalk and GrassJump. Through empirical analysis on LLaMA pretraining, it shows that a core subspace captures most gradient energy early but loses dominance over time and in deeper layers, with the gradient space exhibiting near-flat curvature. The authors propose a Grassmannian-based framework that random-walks or random-jumps on subspaces, coupled with optimizer adaptations (AO) and information-recovery scaling (RS), to preserve learning signals while achieving state-of-the-art memory efficiency. Empirical results on LLaMA-1B and 7B demonstrate strong memory savings and faster convergence, suggesting that randomized strategies, when aligned with gradient dynamics, can be principled and effective for scalable LLM training.
Abstract
Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.
