Table of Contents
Fetching ...

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

TL;DR

This paper questions the effectiveness of low-rank updates in LoRA for learning new knowledge in large language models. It introduces MoRA, a high-rank updating framework that uses a square trainable matrix supplemented by non-parameter compression/decompression operators to preserve the parameter budget and enable merging back into the original model. Across memorization, instruction tuning, reasoning, continual pretraining, and pretraining-from-scratch, MoRA demonstrates superior performance on memory-intensive tasks and competitive results on others, outperforming LoRA in several domains due to higher effective rank. The work also analyzes the impact of decompression/compression strategies and shows that high-rank updating correlates with richer ΔW spectra and lower perplexities, underscoring the practical benefits of MoRA for robust knowledge acquisition in LLMs.

Abstract

Low-rank adaptation is a popular parameter-efficient fine-tuning method for large language models. In this paper, we analyze the impact of low-rank updating, as implemented in LoRA. Our findings suggest that the low-rank updating mechanism may limit the ability of LLMs to effectively learn and memorize new knowledge. Inspired by this observation, we propose a new method called MoRA, which employs a square matrix to achieve high-rank updating while maintaining the same number of trainable parameters. To achieve it, we introduce the corresponding non-parameter operators to reduce the input dimension and increase the output dimension for the square matrix. Furthermore, these operators ensure that the weight can be merged back into LLMs, which makes our method can be deployed like LoRA. We perform a comprehensive evaluation of our method across five tasks: instruction tuning, mathematical reasoning, continual pretraining, memory and pretraining. Our method outperforms LoRA on memory-intensive tasks and achieves comparable performance on other tasks.

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

TL;DR

This paper questions the effectiveness of low-rank updates in LoRA for learning new knowledge in large language models. It introduces MoRA, a high-rank updating framework that uses a square trainable matrix supplemented by non-parameter compression/decompression operators to preserve the parameter budget and enable merging back into the original model. Across memorization, instruction tuning, reasoning, continual pretraining, and pretraining-from-scratch, MoRA demonstrates superior performance on memory-intensive tasks and competitive results on others, outperforming LoRA in several domains due to higher effective rank. The work also analyzes the impact of decompression/compression strategies and shows that high-rank updating correlates with richer ΔW spectra and lower perplexities, underscoring the practical benefits of MoRA for robust knowledge acquisition in LLMs.

Abstract

Low-rank adaptation is a popular parameter-efficient fine-tuning method for large language models. In this paper, we analyze the impact of low-rank updating, as implemented in LoRA. Our findings suggest that the low-rank updating mechanism may limit the ability of LLMs to effectively learn and memorize new knowledge. Inspired by this observation, we propose a new method called MoRA, which employs a square matrix to achieve high-rank updating while maintaining the same number of trainable parameters. To achieve it, we introduce the corresponding non-parameter operators to reduce the input dimension and increase the output dimension for the square matrix. Furthermore, these operators ensure that the weight can be merged back into LLMs, which makes our method can be deployed like LoRA. We perform a comprehensive evaluation of our method across five tasks: instruction tuning, mathematical reasoning, continual pretraining, memory and pretraining. Our method outperforms LoRA on memory-intensive tasks and achieves comparable performance on other tasks.
Paper Structure (20 sections, 11 equations, 5 figures, 7 tables, 2 algorithms)

This paper contains 20 sections, 11 equations, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: An overview of our method compared to LoRA under same number of trainable parameters. $W$ is the frozen weight from model. $A$ and $B$ are trainable low-rank matrices in LoRA. $M$ is the trainable matrix in our method. Gray parts are non-parameter operators to reducing the input dimension and increasing the output dimension. r represents the rank in two methods.
  • Figure 2: Performance of memorizing UUID pairs through fine-tuning with FFT and LoRA.
  • Figure 3: Performance of memorizing UUID pairs with LoRA and our method on rank 8 and 256.
  • Figure 4: Pretraining loss with LoRA and MoRA on 250M and 1B models from scratch. Both LoRA and MoRA use same amount of trainable parameters with $r=128$. ReMoRA and ReLoRA refer to merge MoRA or LoRA back to the model during training to increase the rank of $\Delta W$.
  • Figure 5: The number of singular values $\textgreater 0.1$ in $\Delta W$ on the 250M pretraining model.