Table of Contents
Fetching ...

AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts

Zefang Liu, Jiahua Luo

TL;DR

AdaMoLE tackles the inefficiency of static expert activation in mixture-of-experts fine-tuning for large language models by introducing a dynamic thresholding mechanism that adaptively selects LoRA experts based on input context. The approach replaces a single LoRA with multiple experts and integrates a threshold network with a gating function to balance accuracy and efficiency. Empirical results across commonsense reasoning and NLP benchmarks show AdaMoLE achieving higher accuracy than LoRA and MoLE baselines, with analyses of threshold sensitivity and expert activation guiding optimal settings. The work demonstrates the potential of adaptive expert selection to enhance LLM fine-tuning without increasing the total number of experts, suggesting promising directions for future adaptive MoE research.

Abstract

We introduce AdaMoLE, a novel method for fine-tuning large language models (LLMs) through an Adaptive Mixture of Low-Rank Adaptation (LoRA) Experts. Moving beyond conventional methods that employ a static top-k strategy for activating experts, AdaMoLE dynamically adjusts the activation threshold using a dedicated threshold network, adaptively responding to the varying complexities of different tasks. By replacing a single LoRA in a layer with multiple LoRA experts and integrating a gating function with the threshold mechanism, AdaMoLE effectively selects and activates the most appropriate experts based on the input context. Our extensive evaluations across a variety of commonsense reasoning and natural language processing tasks show that AdaMoLE exceeds baseline performance. This enhancement highlights the advantages of AdaMoLE's adaptive selection of LoRA experts, improving model effectiveness without a corresponding increase in the expert count. The experimental validation not only confirms AdaMoLE as a robust approach for enhancing LLMs but also suggests valuable directions for future research in adaptive expert selection mechanisms, potentially broadening the scope for optimizing model performance across diverse language processing tasks.

AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts

TL;DR

AdaMoLE tackles the inefficiency of static expert activation in mixture-of-experts fine-tuning for large language models by introducing a dynamic thresholding mechanism that adaptively selects LoRA experts based on input context. The approach replaces a single LoRA with multiple experts and integrates a threshold network with a gating function to balance accuracy and efficiency. Empirical results across commonsense reasoning and NLP benchmarks show AdaMoLE achieving higher accuracy than LoRA and MoLE baselines, with analyses of threshold sensitivity and expert activation guiding optimal settings. The work demonstrates the potential of adaptive expert selection to enhance LLM fine-tuning without increasing the total number of experts, suggesting promising directions for future adaptive MoE research.

Abstract

We introduce AdaMoLE, a novel method for fine-tuning large language models (LLMs) through an Adaptive Mixture of Low-Rank Adaptation (LoRA) Experts. Moving beyond conventional methods that employ a static top-k strategy for activating experts, AdaMoLE dynamically adjusts the activation threshold using a dedicated threshold network, adaptively responding to the varying complexities of different tasks. By replacing a single LoRA in a layer with multiple LoRA experts and integrating a gating function with the threshold mechanism, AdaMoLE effectively selects and activates the most appropriate experts based on the input context. Our extensive evaluations across a variety of commonsense reasoning and natural language processing tasks show that AdaMoLE exceeds baseline performance. This enhancement highlights the advantages of AdaMoLE's adaptive selection of LoRA experts, improving model effectiveness without a corresponding increase in the expert count. The experimental validation not only confirms AdaMoLE as a robust approach for enhancing LLMs but also suggests valuable directions for future research in adaptive expert selection mechanisms, potentially broadening the scope for optimizing model performance across diverse language processing tasks.
Paper Structure (18 sections, 7 equations, 3 figures, 7 tables)

This paper contains 18 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration of Adaptive Mixture of Low-Rank Adaptation Experts (AdaMoLE). AdaMoLE employs a gating function alongside a threshold function to determine the activation of experts. In the training phase, pre-trained weights are frozen while the LoRA experts and two functions are updated.
  • Figure 2: Numbers of activated LoRA experts in AdaMoLE for four weight matrices in the self-attention module of each layer, where $\tau$ is the threshold for expert activation and $N$ is the number of experts in one MoE module.
  • Figure 3: Averaged numbers of activated LoRA experts in AdaMoLE for each layer with different upper bounds $\tau_{\max}$, where the expert activation threshold $\tau \in [0, \tau_{\max}]$ and $N$ is the number of experts in one MoE module.