MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

Pengjie Ren; Chengshun Shi; Shiguang Wu; Mengqi Zhang; Zhaochun Ren; Maarten de Rijke; Zhumin Chen; Jiahuan Pei

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

Pengjie Ren, Chengshun Shi, Shiguang Wu, Mengqi Zhang, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Jiahuan Pei

TL;DR

MELoRA introduces a parameter-efficient fine-tuning method that freezes pretrained weights and trains a group of mini LoRAs in parallel, forming a block-diagonal update that yields an equivalent rank of $n \times r$ while reducing trainable parameters by a factor of $n$. The approach guarantees a higher effective rank through diagonal concatenation, improving generalization with fewer parameters. Empirical results on GLUE (RoBERTa-base) and INSTRUCTEVAL (Llama-2-7B) show MELoRA outperforms LoRA and other baselines by notable margins, especially in low-data or instruction-following tasks, with up to 36x fewer trainable parameters. Analyses reveal the equivalent rank is a key driver of performance, with dataset-dependent optimal values for the number of mini LoRAs $n$ and mini-LoRA rank $r$, highlighting MELoRA’s flexibility and efficiency for large-scale PEFT.

Abstract

Parameter-efficient fine-tuning (PEFT) is a popular method for tailoring pre-trained large language models (LLMs), especially as the models' scale and the diversity of tasks increase. Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional, i.e., significant model changes can be represented with relatively few parameters. However, decreasing the rank encounters challenges with generalization errors for specific tasks when compared to full-parameter fine-tuning. We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank, thereby offering improved performance potential. The core idea is to freeze original pretrained weights and train a group of mini LoRAs with only a small number of parameters. This can capture a significant degree of diversity among mini LoRAs, thus promoting better generalization ability. We conduct a theoretical analysis and empirical studies on various NLP tasks. Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks, which demonstrates the effectiveness of MELoRA.

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

TL;DR

while reducing trainable parameters by a factor of

. The approach guarantees a higher effective rank through diagonal concatenation, improving generalization with fewer parameters. Empirical results on GLUE (RoBERTa-base) and INSTRUCTEVAL (Llama-2-7B) show MELoRA outperforms LoRA and other baselines by notable margins, especially in low-data or instruction-following tasks, with up to 36x fewer trainable parameters. Analyses reveal the equivalent rank is a key driver of performance, with dataset-dependent optimal values for the number of mini LoRAs

and mini-LoRA rank

, highlighting MELoRA’s flexibility and efficiency for large-scale PEFT.

Abstract

Paper Structure (24 sections, 3 equations, 8 figures, 8 tables)

This paper contains 24 sections, 3 equations, 8 figures, 8 tables.

Introduction
Related Work
Adaptive Rank
Customized Update Strategies
Methodology
Preliminaries on Low-Rank Adapter
Matrix Rank Theory
Mini-Ensemble Low-Rank Adapter
Experimental Setups
Baselines
Datasets
Implementation Details
Results
Performance on GLUE
Performance on INSTRUCTEVAL
...and 9 more sections

Figures (8)

Figure 1: Comparison between LoRA (left) and the proposed MELoRA (right). The core idea of MELoRA is to freeze original pretrained weights and train a group of mini LoRA in parallel with only a small number of parameters.
Figure 2: An illustration of how in MELoRA adopt a group of mini LoRA modules to obtain sparse equivalent $B$, $A$. $x \in \mathbb{R}^{d}$ denote a representation with $d$ dimensions, $A_i \in \mathbb{R}^{\frac{r}{n} \times \frac{d}{n}}$, $B_i \in \mathbb{R}^{\frac{d}{n} \times \frac{r}{n}}$ ($r \ll d)$, and 0 denotes zero metrics requiring no training.
Figure 3: The sum of singular values $> 0.1$ of $B\times A$ in LoRA and equivalent $B\times A$ in MELoRA.
Figure 4: Performance with different number of mini LoRAs $n$ and fixed rank $r$ on different datasets. We report the same metrics as Table \ref{['tab:NLU_main']}. More results can be found in Appendix \ref{['sec:appendix_n']}.
Figure 5: Performance of LoRA and MELoRA with different rank $r$ and fixed $n$ on different datasets. More results can be found in Appendix \ref{['sec:appendix_r']}.
...and 3 more figures

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

TL;DR

Abstract

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)