Table of Contents
Fetching ...

DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism

Dengchun Li, Naizheng Wang, Zihao Zhang, Haoyang Yin, Lei Duan, Meng Xiao, Mingjie Tang

TL;DR

DynMoLE tackles inconsistent expert selection and layer-wise routing demands in Mixture of LoRA Experts by introducing a dynamic, Tsallis-entropy-based routing mechanism. It combines a hybrid routing strategy with an auxiliary Tsallis-entropy loss to reduce router uncertainty and balance expert loads, improving convergence and performance. Empirical results on commonsense reasoning benchmarks show DynMoLE outperforming LoRA by 9.6% and MoLA by 2.3%, with ablations confirming the effectiveness of entropy weighting, entropic index, and threshold choices. The approach offers a practical, parameter-efficient method to enhance MoLE-based fine-tuning for large language models such as LLaMA-2-7B.

Abstract

Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE's key components.

DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism

TL;DR

DynMoLE tackles inconsistent expert selection and layer-wise routing demands in Mixture of LoRA Experts by introducing a dynamic, Tsallis-entropy-based routing mechanism. It combines a hybrid routing strategy with an auxiliary Tsallis-entropy loss to reduce router uncertainty and balance expert loads, improving convergence and performance. Empirical results on commonsense reasoning benchmarks show DynMoLE outperforming LoRA by 9.6% and MoLA by 2.3%, with ablations confirming the effectiveness of entropy weighting, entropic index, and threshold choices. The approach offers a practical, parameter-efficient method to enhance MoLE-based fine-tuning for large language models such as LLaMA-2-7B.

Abstract

Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE's key components.

Paper Structure

This paper contains 33 sections, 33 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Visualized motivation of DynMoLE. We propose a hybrid routing mechanism for DynMoLE to address and solve these critical challenges.
  • Figure 2: Tsallis entropy provides a more stable optimization process than Shannon entropy by reducing the impact of low-probability events.
  • Figure 3: Comparison of three routing strategies: (a) the classic Top-K Routing, here we use Top-2 as example; (b) the classic Top-P Routing, where the blue bars represent the sum of the highest probabilities; and (c) DynMoLE Hybrid Routing, where the green bars represent the entropy values across different probability distributions.
  • Figure 4:
  • Figure 5:
  • ...and 12 more figures