Table of Contents
Fetching ...

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang

TL;DR

Fira presents a memory-efficient framework that enables full-rank training for LLMs under a low-rank constraint by using norm-based scaling to approximate full-gradient corrections and a norm-growth limiter to stabilize optimization. The approach preserves optimizer-state information while maintaining a low-rank subspace, yielding performance that matches or surpasses full-rank training in pre-training and fine-tuning across multiple model sizes. Extensive experiments demonstrate substantial memory savings and robust improvements over LoRA and GaLore, validating Fira's practicality for large-scale LLM training. This work offers a scalable, plug-and-play pathway to high-performance, memory-efficient LLM training beyond existing low-rank methods.

Abstract

Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

TL;DR

Fira presents a memory-efficient framework that enables full-rank training for LLMs under a low-rank constraint by using norm-based scaling to approximate full-gradient corrections and a norm-growth limiter to stabilize optimization. The approach preserves optimizer-state information while maintaining a low-rank subspace, yielding performance that matches or surpasses full-rank training in pre-training and fine-tuning across multiple model sizes. Extensive experiments demonstrate substantial memory savings and robust improvements over LoRA and GaLore, validating Fira's practicality for large-scale LLM training. This work offers a scalable, plug-and-play pathway to high-performance, memory-efficient LLM training beyond existing low-rank methods.

Abstract

Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.
Paper Structure (37 sections, 46 equations, 14 figures, 20 tables, 2 algorithms)

This paper contains 37 sections, 46 equations, 14 figures, 20 tables, 2 algorithms.

Figures (14)

  • Figure 1: This analyses three types of memory-efficient approaches at a macro level.
  • Figure 2: Training loss of different methods for pre-training LLaMA 60M on C4 dataset ($r/d_{model}$ = 16/256 and T = 200).
  • Figure 3: Training loss and gradient norm of three variants of Fira for pre-training LLaMA 60M.
  • Figure 4: Pre-training LLaMA 7B with different methods on the C4 dataset.
  • Figure 5: Validation perplexity of Fira and GaLore for varying ranks when pre-training LLaMA 60M on the C4 dataset with $d_{model} = 256$.
  • ...and 9 more figures

Theorems & Definitions (2)

  • proof
  • proof