Higher Layers Need More LoRA Experts
Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, VS Subrahmanian
TL;DR
The paper addresses the efficiency-performance trade-off in fine-tuning large language models by combining Mixture-of-Experts (MoE) with LoRA adapters through layer-wise expert allocation (MoLA). It introduces four allocation configurations, with MoLA-$\triangledown$ (inverted triangle) delivering the strongest performance while using substantially fewer trainable parameters than fixed-layer baselines. Across six NLP and commonsense QA benchmarks, MoLA variants outperform traditional PEFT baselines and exhibit favorable continuous-learning behavior, suggesting robust generalization and adaptability. The work provides a plug-and-play PEFT approach that reduces training costs and offers insights into layer-wise redundancy, highlighting higher layers as the key leverage points for performance gains.
Abstract
Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, \textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert \textbf{A}llocation (MoLA)} for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines. We find that allocating more LoRA experts to higher layers further enhances the effectiveness of models with a certain number of experts in total. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code is available at https://github.com/GCYZSL/MoLA.
