Table of Contents
Fetching ...

Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Yongchang Hao, Yanshuai Cao, Lili Mou

TL;DR

Flora tackles the memory bottleneck in training large neural networks by reframing LoRA as gradient compression via random projections and introducing a sublinear-memory mechanism. By repeatedly resampling projection matrices, Flora enables high-rank updates while keeping optimization-state memory near sublinear in model size. Empirical results across summarization and translation tasks show Flora matches or closely approaches full-matrix updates with substantial memory savings, outperforming LoRA in many settings. The approach is complementary to existing memory-saving techniques and scalable to various architectures, offering practical benefits for training larger models efficiently.

Abstract

Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.

Flora: Low-Rank Adapters Are Secretly Gradient Compressors

TL;DR

Flora tackles the memory bottleneck in training large neural networks by reframing LoRA as gradient compression via random projections and introducing a sublinear-memory mechanism. By repeatedly resampling projection matrices, Flora enables high-rank updates while keeping optimization-state memory near sublinear in model size. Empirical results across summarization and translation tasks show Flora matches or closely approaches full-matrix updates with substantial memory savings, outperforming LoRA in many settings. The approach is complementary to existing memory-saving techniques and scalable to various architectures, offering practical benefits for training larger models efficiently.

Abstract

Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.
Paper Structure (39 sections, 9 theorems, 32 equations, 2 figures, 6 tables, 2 algorithms)

This paper contains 39 sections, 9 theorems, 32 equations, 2 figures, 6 tables, 2 algorithms.

Key Result

Theorem 2.1

Let LoRA update matrices $A$ and $B$ with SGD for every step $t$ by where $\eta$ is the learning rate. We assume $\| \sum_{t=0}^{T} \nabla_W {\mathcal{L}}_t \|_F \le L$ for every $T$ during training, which implies that the model stays within a finite Euclidean ball. In this case, the dynamics of $A_t$ and $B_t$ are given by where the forms of $f_A(t) \in {\mathbb{R}}^{m \times m}$ and $f_B(t)

Figures (2)

  • Figure 1: The results of LoRA and its simplifications. We apply the LoRA patch to the first layer of the network with a shape of $768\times768$ and set $r=8$. The legend LoRA is the original LoRA method, while LoRA(B) is the simplification where only the matrix $B$ is updated. RP (random projection) and RRP (resampled RP) follow the same update rule \ref{['eq:rp']}, but RRP uses different projection matrices at different steps. In addition, we show the results of SGD on the full model for comparison. All experiments use the same $\eta =0.01$.
  • Figure 2: Profiling the memory usage by categories during four iterations of training steps.

Theorems & Definitions (16)

  • Theorem 2.1
  • proof
  • Lemma 2.3: indyk1998approximate
  • Theorem 2.4
  • proof
  • Theorem 1.1
  • Lemma 1.1
  • Lemma 1.1
  • proof
  • proof : Proof of Theorem \ref{['thm:lora-rp']}
  • ...and 6 more