MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang
TL;DR
This paper questions the effectiveness of low-rank updates in LoRA for learning new knowledge in large language models. It introduces MoRA, a high-rank updating framework that uses a square trainable matrix supplemented by non-parameter compression/decompression operators to preserve the parameter budget and enable merging back into the original model. Across memorization, instruction tuning, reasoning, continual pretraining, and pretraining-from-scratch, MoRA demonstrates superior performance on memory-intensive tasks and competitive results on others, outperforming LoRA in several domains due to higher effective rank. The work also analyzes the impact of decompression/compression strategies and shows that high-rank updating correlates with richer ΔW spectra and lower perplexities, underscoring the practical benefits of MoRA for robust knowledge acquisition in LLMs.
Abstract
Low-rank adaptation is a popular parameter-efficient fine-tuning method for large language models. In this paper, we analyze the impact of low-rank updating, as implemented in LoRA. Our findings suggest that the low-rank updating mechanism may limit the ability of LLMs to effectively learn and memorize new knowledge. Inspired by this observation, we propose a new method called MoRA, which employs a square matrix to achieve high-rank updating while maintaining the same number of trainable parameters. To achieve it, we introduce the corresponding non-parameter operators to reduce the input dimension and increase the output dimension for the square matrix. Furthermore, these operators ensure that the weight can be merged back into LLMs, which makes our method can be deployed like LoRA. We perform a comprehensive evaluation of our method across five tasks: instruction tuning, mathematical reasoning, continual pretraining, memory and pretraining. Our method outperforms LoRA on memory-intensive tasks and achieves comparable performance on other tasks.
