Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Xi Chen; Kaituo Feng; Changsheng Li; Xunhao Lai; Xiangyu Yue; Ye Yuan; Guoren Wang

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang

TL;DR

Fira presents a memory-efficient framework that enables full-rank training for LLMs under a low-rank constraint by using norm-based scaling to approximate full-gradient corrections and a norm-growth limiter to stabilize optimization. The approach preserves optimizer-state information while maintaining a low-rank subspace, yielding performance that matches or surpasses full-rank training in pre-training and fine-tuning across multiple model sizes. Extensive experiments demonstrate substantial memory savings and robust improvements over LoRA and GaLore, validating Fira's practicality for large-scale LLM training. This work offers a scalable, plug-and-play pathway to high-performance, memory-efficient LLM training beyond existing low-rank methods.

Abstract

Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

TL;DR

Abstract

Paper Structure (37 sections, 46 equations, 14 figures, 20 tables, 2 algorithms)

This paper contains 37 sections, 46 equations, 14 figures, 20 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Low-Rank Adaptation
Gradient Low-Rank Projection
Proposed Method
Norm-Based Scaling
Norm-Growth Limiter
Overall Algorithm
Experiments
Memory-Efficient Pre-training
Scaling up to LLaMA 7B Pre-training
Memory-Efficient Fine-Tuning
Ablation Study
Performance under Varying Ranks
...and 22 more sections

Figures (14)

Figure 1: This analyses three types of memory-efficient approaches at a macro level.
Figure 2: Training loss of different methods for pre-training LLaMA 60M on C4 dataset ($r/d_{model}$ = 16/256 and T = 200).
Figure 3: Training loss and gradient norm of three variants of Fira for pre-training LLaMA 60M.
Figure 4: Pre-training LLaMA 7B with different methods on the C4 dataset.
Figure 5: Validation perplexity of Fira and GaLore for varying ranks when pre-training LLaMA 60M on the C4 dataset with $d_{model} = 256$.
...and 9 more figures

Theorems & Definitions (2)

proof
proof

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

TL;DR

Abstract

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Authors

TL;DR

Abstract

Table of Contents

Figures (14)

Theorems & Definitions (2)