Higher Layers Need More LoRA Experts

Chongyang Gao; Kezhen Chen; Jinmeng Rao; Baochen Sun; Ruibo Liu; Daiyi Peng; Yawen Zhang; Xiaoyuan Guo; Jie Yang; VS Subrahmanian

Higher Layers Need More LoRA Experts

Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, VS Subrahmanian

TL;DR

The paper addresses the efficiency-performance trade-off in fine-tuning large language models by combining Mixture-of-Experts (MoE) with LoRA adapters through layer-wise expert allocation (MoLA). It introduces four allocation configurations, with MoLA-$\triangledown$ (inverted triangle) delivering the strongest performance while using substantially fewer trainable parameters than fixed-layer baselines. Across six NLP and commonsense QA benchmarks, MoLA variants outperform traditional PEFT baselines and exhibit favorable continuous-learning behavior, suggesting robust generalization and adaptability. The work provides a plug-and-play PEFT approach that reduces training costs and offers insights into layer-wise redundancy, highlighting higher layers as the key leverage points for performance gains.

Abstract

Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, \textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert \textbf{A}llocation (MoLA)} for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines. We find that allocating more LoRA experts to higher layers further enhances the effectiveness of models with a certain number of experts in total. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code is available at https://github.com/GCYZSL/MoLA.

Higher Layers Need More LoRA Experts

TL;DR

(inverted triangle) delivering the strongest performance while using substantially fewer trainable parameters than fixed-layer baselines. Across six NLP and commonsense QA benchmarks, MoLA variants outperform traditional PEFT baselines and exhibit favorable continuous-learning behavior, suggesting robust generalization and adaptability. The work provides a plug-and-play PEFT approach that reduces training costs and offers insights into layer-wise redundancy, highlighting higher layers as the key leverage points for performance gains.

Abstract

Paper Structure (28 sections, 6 equations, 5 figures, 4 tables)

This paper contains 28 sections, 6 equations, 5 figures, 4 tables.

Introduction
Preliminaries
Mixture of Experts
LoRA
MoE-LoRA with Layer-wise Allocation
The MoLA Architecture
Configurations of Layer-wise Expert Allocation
MoLA Triangle (MoLA-$\triangle$)
MoLA Inverted-Triangle (MoLA-$\triangledown$)
MoLA Hourglass (MoLA-$\bowtie$)
MoLA Rectangle (MoLA-$\square$)
Experiments
Experiment Settings
Task and Data
Recent Competitive Baselines
...and 13 more sections

Figures (5)

Figure 1: The overview of MoLA architecture. MoLA applies LoRA-MoE on a pre-trained Transformer model with layer-wise expert allocation. Each layer employs a different number of experts. During training, the pre-trained weights are freeze and only LoRA experts are tuned as the adapters on the weights.
Figure 2: Four types of layer-wise expert allocations of MoLA.
Figure 3: Average number of the Frobenius Norm between two different experts' weight matrices for each self-attention module from each layer. The top figure is for the MoLA-$\square$(8888), and the bottom figure is for MoLA-$\square$ (5555). Both models are trained via instruction tuning.
Figure 4: The average number of the Frobenius Norm between two different experts' weight matrices at the same layer for each self-attention module. The top figure is for the MoLA-$\triangledown$ with configuration as 2468; the middle figure is for the MoLA-$\triangle$ with configuration as 8642; and the bottom figure is for MoLA-$\bowtie$ with configuration 8228 after instruction tuning.
Figure 5: (a) The average fusion weights for each expert. (b) The average times for each expert when it is selected.

Theorems & Definitions (1)

Definition 5.1

Higher Layers Need More LoRA Experts

TL;DR

Abstract

Higher Layers Need More LoRA Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (1)