Table of Contents
Fetching ...

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

Yehonathan Refael, Guy Smorodinsky, Tom Tirer, Ofir Lindenbaum

TL;DR

SUMO addresses the slowdown in convergence observed with memory-efficient, low-rank optimizers by introducing exact SVD-based moment orthogonalization within a dynamically updated low-rank subspace. The authors show that first-order moments become increasingly low-rank during LLM training and that Newton-Schulz5 orthogonalization incurs error that grows with the moment condition number, whereas SVD offers faster, more stable convergence with manageable cost in low-rank regimes. They provide a convergence guarantee to an $oldsymbol{\varepsilon}$-critical point and demonstrate through GLUE fine-tuning and LLAMA pre-training experiments that SUMO achieves faster optimization, improved stability, and memory reductions up to approximately $20\%$ compared to state-of-the-art methods. Overall, SUMO offers a practical, spectrally-informed optimization strategy that enhances memory efficiency while accelerating convergence in large-scale language model training.

Abstract

Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware without sacrificing performance. However, these methods primarily emphasize memory savings, often overlooking potential acceleration in convergence due to their reliance on standard isotropic steepest descent techniques, which can perform suboptimally in the highly anisotropic landscapes typical of deep networks, particularly LLMs. In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that employs exact singular value decomposition (SVD) for moment orthogonalization within a dynamically adapted low-dimensional subspace, enabling norm-inducing steepest descent optimization steps. By explicitly aligning optimization steps with the spectral characteristics of the loss landscape, SUMO effectively mitigates approximation errors associated with commonly used methods like Newton-Schulz orthogonalization approximation. We theoretically establish an upper bound on these approximation errors, proving their dependence on the condition numbers of moments, conditions we analytically demonstrate are encountered during LLM training. Furthermore, we both theoretically and empirically illustrate that exact orthogonalization via SVD substantially improves convergence rates while reducing overall complexity. Empirical evaluations confirm that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

TL;DR

SUMO addresses the slowdown in convergence observed with memory-efficient, low-rank optimizers by introducing exact SVD-based moment orthogonalization within a dynamically updated low-rank subspace. The authors show that first-order moments become increasingly low-rank during LLM training and that Newton-Schulz5 orthogonalization incurs error that grows with the moment condition number, whereas SVD offers faster, more stable convergence with manageable cost in low-rank regimes. They provide a convergence guarantee to an -critical point and demonstrate through GLUE fine-tuning and LLAMA pre-training experiments that SUMO achieves faster optimization, improved stability, and memory reductions up to approximately compared to state-of-the-art methods. Overall, SUMO offers a practical, spectrally-informed optimization strategy that enhances memory efficiency while accelerating convergence in large-scale language model training.

Abstract

Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware without sacrificing performance. However, these methods primarily emphasize memory savings, often overlooking potential acceleration in convergence due to their reliance on standard isotropic steepest descent techniques, which can perform suboptimally in the highly anisotropic landscapes typical of deep networks, particularly LLMs. In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that employs exact singular value decomposition (SVD) for moment orthogonalization within a dynamically adapted low-dimensional subspace, enabling norm-inducing steepest descent optimization steps. By explicitly aligning optimization steps with the spectral characteristics of the loss landscape, SUMO effectively mitigates approximation errors associated with commonly used methods like Newton-Schulz orthogonalization approximation. We theoretically establish an upper bound on these approximation errors, proving their dependence on the condition numbers of moments, conditions we analytically demonstrate are encountered during LLM training. Furthermore, we both theoretically and empirically illustrate that exact orthogonalization via SVD substantially improves convergence rates while reducing overall complexity. Empirical evaluations confirm that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.

Paper Structure

This paper contains 24 sections, 7 theorems, 65 equations, 2 figures, 11 tables, 1 algorithm.

Key Result

Lemma 3.1

Let ${\bf M}^{(t)} \in \mathbb{R}^{n \times m}$ denote the first moment of a reversible layerReversible networks are formally defined in Appendix Reversibility in a moment-based optimization algorithm, updated according to ${\bf M}^{(t)} = \beta_1 {\bf M}^{(t-1)} + {\bf G}^{(t)},$ where ${\bf G}^{(t satisfies $\kappa_M(t) \leq O(C^{-t})$ for some constant $C > 1$.

Figures (2)

  • Figure 1: Evidence of anisotropy and ill-conditioning in the first-order moment matrix as a function of the Galore steps of the Roberta-base model liu2019roberta on the GLUE dataset RTE task wang2019superglue: (a) condition number growth, (b) spectral decay of moment.
  • Figure 2: SUMO with SVD demonstrates superior convergence speed ($\sim\! 1.6\times$ faster), attaining comparable or higher accuracy than GaLore and SUMO with Newton-Schultz5 with significantly fewer optimization steps on QNLI.

Theorems & Definitions (20)

  • Lemma 3.1: Moment Becomes Low-Rank During Training
  • Lemma 3.2: Orthogonalization error $\mathbf{\mathcal{E}}_{i}$
  • Lemma 3.3: Exact convergence rate of Muon
  • Remark 3.4: Comparison: slower convergence vs exact orthogonalization
  • Remark 3.5: The impact of $\delta$ on the convergence rate
  • Remark 3.6: The size of $\delta$
  • Remark 3.7: Speed-up by SVD vs Newton-Schulz5 approximation
  • Theorem 3.8: Convergence of SUMO
  • proof : Proof of Lemma \ref{['lem::moment_lowrank']}
  • Lemma A.1: Descent Lemma with Newton-Schulz Approximation Error
  • ...and 10 more