Table of Contents
Fetching ...

MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Jie Cao, Tianwei Lin, Bo Yuan, Rolan Yan, Hongyang He, Wenqiao Zhang, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang

TL;DR

This work tackles the efficiency bottlenecks of parameter-efficient fine-tuning for large language models by addressing representation collapse and expert load imbalance in homogeneous MoE-LoRA designs. It introduces Mixture-of-Adapters (MoA), a heterogeneous ensemble of PEFT adapters with token-level routing, and two practical variants: Soft MoA (soft fusion via a sigmoid router) and Sparse MoA (learnable per-token thresholds for active experts, $\Gamma = \Gamma_{max}{\rm Sigmoid}(\boldsymbol{W}_{\Gamma}^{T}\boldsymbol{x} + \boldsymbol{b}_{\Gamma})$). The MoA framework assembles diverse adapters (including five LoRA modules, FFN Parallel Adapters, and a zero-initialized Prompt Tuning) to promote specialization and efficient knowledge transfer, achieving higher accuracy and better resource efficiency than state-of-the-art homogeneous MoE-LoRA baselines on math, commonsense, and code-generation tasks. Across multiple foundation models, Soft MoA and Sparse MoA demonstrate superior training efficiency, memory footprint, and inference latency while using far fewer trainable parameters, underscoring the practical impact of architectural heterogeneity in PEFT for LLMs.

Abstract

Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.

MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

TL;DR

This work tackles the efficiency bottlenecks of parameter-efficient fine-tuning for large language models by addressing representation collapse and expert load imbalance in homogeneous MoE-LoRA designs. It introduces Mixture-of-Adapters (MoA), a heterogeneous ensemble of PEFT adapters with token-level routing, and two practical variants: Soft MoA (soft fusion via a sigmoid router) and Sparse MoA (learnable per-token thresholds for active experts, ). The MoA framework assembles diverse adapters (including five LoRA modules, FFN Parallel Adapters, and a zero-initialized Prompt Tuning) to promote specialization and efficient knowledge transfer, achieving higher accuracy and better resource efficiency than state-of-the-art homogeneous MoE-LoRA baselines on math, commonsense, and code-generation tasks. Across multiple foundation models, Soft MoA and Sparse MoA demonstrate superior training efficiency, memory footprint, and inference latency while using far fewer trainable parameters, underscoring the practical impact of architectural heterogeneity in PEFT for LLMs.

Abstract

Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.

Paper Structure

This paper contains 41 sections, 11 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: MoA architecture with heterogeneous PEFT adapters. It is worth noting that in Sparse MoA, the Prompt Tuning module is deactivated due to its non-token-level activation mechanism.
  • Figure 2: Comparison of different models under identical batch sizes (1* 48G GPU): (a) Training time per epoch, (b) GPU memory consumption during training, and (c) Average inference time per sample.
  • Figure 3: Comparison of router weight distributions between Soft MoA and AdaMoLE on the BoolQ test set under different random seeds. The MoA method exhibits strong consistency, whereas AdaMoLE does not.
  • Figure 4: Visualization of average router weights per layer for Soft MoA (left) and Sparse MoA (right) on ARC-Challenge, averaged over tokens within 50 samples. Sparse MoA also includes the average per-layer threshold and the average count of activated experts. The average count of activated experts across layers in Sparse MoA is 3.55.
  • Figure 5: Visualization of average router weights per layer for Soft MoA (left) and Sparse MoA (right) on gsm8k, averaged over tokens within 50 samples. The Sparse MoA plot (right) also includes the average per-layer threshold and the average count of activated experts. The average count of activated experts across layers in Sparse MoA is 4.13.
  • ...and 3 more figures