Table of Contents
Fetching ...

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Yehonathan Refael, Jonathan Svirsky, Boris Shustin, Wasim Huleihel, Ofir Lindenbaum

TL;DR

AdaRankGrad introduces adaptive gradient-rank projections to enable memory-efficient full-parameter fine-tuning of large models. The approach is grounded in a theoretical gradient-rank vanishing phenomenon and leverages an SSRF-based randomized SVD for fast subspace identification, with moments transformed across subspaces to preserve training dynamics. The method demonstrates memory reductions and competitive performance across GLUE fine-tuning, Geneformer omics fine-tuning, and LLama pre-training on C4, outperforming or matching LoRA and GaLore baselines. This work provides a practical pathway to memory-efficient, high-performance fine-tuning of language and biological foundation models, with convergence guarantees and broad applicability.

Abstract

Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging this, our approach involves adaptively reducing the rank of the gradients during Adam optimization steps, using an efficient online-updating low-rank projections rule. We further present a randomized SVD scheme for efficiently finding the projection matrix. Our technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, significantly reducing overall memory requirements during training compared to state-of-the-art methods while improving model performance in both pretraining and fine-tuning. Finally, we provide a convergence analysis of our method and demonstrate its merits for training and fine-tuning language and biological foundation models.

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

TL;DR

AdaRankGrad introduces adaptive gradient-rank projections to enable memory-efficient full-parameter fine-tuning of large models. The approach is grounded in a theoretical gradient-rank vanishing phenomenon and leverages an SSRF-based randomized SVD for fast subspace identification, with moments transformed across subspaces to preserve training dynamics. The method demonstrates memory reductions and competitive performance across GLUE fine-tuning, Geneformer omics fine-tuning, and LLama pre-training on C4, outperforming or matching LoRA and GaLore baselines. This work provides a practical pathway to memory-efficient, high-performance fine-tuning of language and biological foundation models, with convergence guarantees and broad applicability.

Abstract

Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging this, our approach involves adaptively reducing the rank of the gradients during Adam optimization steps, using an efficient online-updating low-rank projections rule. We further present a randomized SVD scheme for efficiently finding the projection matrix. Our technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, significantly reducing overall memory requirements during training compared to state-of-the-art methods while improving model performance in both pretraining and fine-tuning. Finally, we provide a convergence analysis of our method and demonstrate its merits for training and fine-tuning language and biological foundation models.

Paper Structure

This paper contains 23 sections, 3 theorems, 29 equations, 6 figures, 6 tables, 3 algorithms.

Key Result

Lemma $\mathbf{}$

Given a reversible neural network and using the vanilla setting of SGD for weight update. Then, for some constant $C>1$.

Figures (6)

  • Figure 1: The illustration shows how AdaRankGard \ref{['alg::AdaRankGrad']} is trained. First, the gradients ${\bf G}_t$ are projected into a 3D space (in this example), represented as $\Hat{{\bf G}}_t^{3\times m}={\bf P}^{3\times n}_t{\bf G}_t^{n\times m}$. As convergence occurs, the gradient's dimension decreases to a 2D space and then to a 1D space. This dimensionality reduction indicates convergence while efficiently using memory.
  • Figure 2: The figure illustrates the exponential decay of eigenvalues in the MLP layer's gradient, at the first iteration of fine-tuning RoBERTa-Base liu2019roberta model, on the MRPC task, from GLUE wang2019superglue. Notably, the red line indicates that 50% of the gradient information (in terms of squared norm ratio) is captured by the first eigenvalue, while the green line shows that 90% is contained within the first two eigenvalues.
  • Figure 3: The figure presents the effective rank (see Section \ref{['sec::experiments']}) measured after every $100$ update steps on the RTE dataset, from GLUE wang2019superglue.
  • Figure 4: We present the effective rank measured for non-attention layers and corresponding memory reduction for AdaRankGrad trained on MRPC (left panel) and RTE (right panel) datasets from the GLUE benchmark.
  • Figure 5: The left graph presents the effective rank measured for different values of $\eta_{th}$ while training AdaRankGrad on MRPC dataset. The right graph present the effective rank measured for different values of $\eta_{th}$ while training AdaRankGrad on MRPC dataset.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 1: Approximate low-rank matrix
  • Definition 2
  • Lemma $\mathbf{}$: Asymptotically rank-one
  • Theorem $\mathbf{}$: Convergence of Algorithm \ref{['alg::AdaRankGrad']}
  • Lemma $\mathbf{}$: Convergence of low-rank optimization block
  • proof : Proof of Lemma \ref{['lem::ineer_convergance']}