DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism
Dengchun Li, Naizheng Wang, Zihao Zhang, Haoyang Yin, Lei Duan, Meng Xiao, Mingjie Tang
TL;DR
DynMoLE tackles inconsistent expert selection and layer-wise routing demands in Mixture of LoRA Experts by introducing a dynamic, Tsallis-entropy-based routing mechanism. It combines a hybrid routing strategy with an auxiliary Tsallis-entropy loss to reduce router uncertainty and balance expert loads, improving convergence and performance. Empirical results on commonsense reasoning benchmarks show DynMoLE outperforming LoRA by 9.6% and MoLA by 2.3%, with ablations confirming the effectiveness of entropy weighting, entropic index, and threshold choices. The approach offers a practical, parameter-efficient method to enhance MoLE-based fine-tuning for large language models such as LLaMA-2-7B.
Abstract
Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE's key components.
