SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training
Sahar Rajabi, Nayeema Nonta, Sirisha Rambhatla
TL;DR
Large language model training is hindered by memory and time demands. SubTrack++ introduces Grassmannian gradient subspace tracking combined with a projection-aware optimizer and recovery scaling to enable full-parameter training within the same memory footprint while delivering substantial speedups. Theoretical convergence guarantees accompany empirical evidence showing state-of-the-art convergence and significant wall-time reductions across multiple model scales. This approach offers a practical, scalable path toward democratizing access to large models without sacrificing performance.
Abstract
Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of optimizer states. While recent work has aimed to reduce memory consumption, such efforts often entail trade-offs among memory efficiency, training time, and model performance. Yet, true democratization of LLMs requires simultaneous progress across all three dimensions. To this end, we propose SubTrack++ that leverages Grassmannian gradient subspace tracking combined with projection-aware optimizers, enabling Adam's internal statistics to adapt to subspace changes. Additionally, employing recovery scaling, a technique that restores information lost through low-rank projections, further enhances model performance. Our method demonstrates SOTA convergence by exploiting Grassmannian geometry, reducing pre-training wall-time by up to 65% and fine-tuning time by 36% compared to existing SOTA methods, while maintaining the same memory footprint.
