Table of Contents
Fetching ...

MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

Yufei Ma, Zihan Liang, Huangyu Dai, Ben Chen, Dehong Gao, Zhuoran Ran, Wang Zihan, Linbo Jin, Wen Jiang, Guannan Zhang, Xiaoyan Cai, Libin Yang

TL;DR

MoDULA tackles the challenge of efficiently fine-tuning large language models for multiple tasks by introducing a Mixture of Domain-Specific and Universal LoRA (MoDULA) with MoE gating. It proposes two variants, MoDULA-Flan and MoDULA-Res, plus a three-stage training regime that first learns a universal expert, then domain-specific experts, and finally trains only the router, with MoDULA-Res adding a residual path to preserve general knowledge. Across open LLMs (e.g., LLaMA-2, Qwen, Yi), MoDULA outperforms MoLoRA and shows substantial training cost reductions (over 80%), with MoDULA-Res delivering the strongest improvements and better stability, particularly in domain-specific tasks like finance and e-commerce. The approach demonstrates strong pluggability for adding new domains without retraining existing experts, making it a scalable solution for multi-task fine-tuning in resource-constrained settings with improved generalization and task-specific performance.

Abstract

The growing demand for larger-scale models in the development of \textbf{L}arge \textbf{L}anguage \textbf{M}odels (LLMs) poses challenges for efficient training within limited computational resources. Traditional fine-tuning methods often exhibit instability in multi-task learning and rely heavily on extensive training resources. Here, we propose MoDULA (\textbf{M}ixture \textbf{o}f \textbf{D}omain-Specific and \textbf{U}niversal \textbf{L}oR\textbf{A}), a novel \textbf{P}arameter \textbf{E}fficient \textbf{F}ine-\textbf{T}uning (PEFT) \textbf{M}ixture-\textbf{o}f-\textbf{E}xpert (MoE) paradigm for improved fine-tuning and parameter efficiency in multi-task learning. The paradigm effectively improves the multi-task capability of the model by training universal experts, domain-specific experts, and routers separately. MoDULA-Res is a new method within the MoDULA paradigm, which maintains the model's general capability by connecting universal and task-specific experts through residual connections. The experimental results demonstrate that the overall performance of the MoDULA-Flan and MoDULA-Res methods surpasses that of existing fine-tuning methods on various LLMs. Notably, MoDULA-Res achieves more significant performance improvements in multiple tasks while reducing training costs by over 80\% without losing general capability. Moreover, MoDULA displays flexible pluggability, allowing for the efficient addition of new tasks without retraining existing experts from scratch. This progressive training paradigm circumvents data balancing issues, enhancing training efficiency and model stability. Overall, MoDULA provides a scalable, cost-effective solution for fine-tuning LLMs with enhanced parameter efficiency and generalization capability.

MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

TL;DR

MoDULA tackles the challenge of efficiently fine-tuning large language models for multiple tasks by introducing a Mixture of Domain-Specific and Universal LoRA (MoDULA) with MoE gating. It proposes two variants, MoDULA-Flan and MoDULA-Res, plus a three-stage training regime that first learns a universal expert, then domain-specific experts, and finally trains only the router, with MoDULA-Res adding a residual path to preserve general knowledge. Across open LLMs (e.g., LLaMA-2, Qwen, Yi), MoDULA outperforms MoLoRA and shows substantial training cost reductions (over 80%), with MoDULA-Res delivering the strongest improvements and better stability, particularly in domain-specific tasks like finance and e-commerce. The approach demonstrates strong pluggability for adding new domains without retraining existing experts, making it a scalable solution for multi-task fine-tuning in resource-constrained settings with improved generalization and task-specific performance.

Abstract

The growing demand for larger-scale models in the development of \textbf{L}arge \textbf{L}anguage \textbf{M}odels (LLMs) poses challenges for efficient training within limited computational resources. Traditional fine-tuning methods often exhibit instability in multi-task learning and rely heavily on extensive training resources. Here, we propose MoDULA (\textbf{M}ixture \textbf{o}f \textbf{D}omain-Specific and \textbf{U}niversal \textbf{L}oR\textbf{A}), a novel \textbf{P}arameter \textbf{E}fficient \textbf{F}ine-\textbf{T}uning (PEFT) \textbf{M}ixture-\textbf{o}f-\textbf{E}xpert (MoE) paradigm for improved fine-tuning and parameter efficiency in multi-task learning. The paradigm effectively improves the multi-task capability of the model by training universal experts, domain-specific experts, and routers separately. MoDULA-Res is a new method within the MoDULA paradigm, which maintains the model's general capability by connecting universal and task-specific experts through residual connections. The experimental results demonstrate that the overall performance of the MoDULA-Flan and MoDULA-Res methods surpasses that of existing fine-tuning methods on various LLMs. Notably, MoDULA-Res achieves more significant performance improvements in multiple tasks while reducing training costs by over 80\% without losing general capability. Moreover, MoDULA displays flexible pluggability, allowing for the efficient addition of new tasks without retraining existing experts from scratch. This progressive training paradigm circumvents data balancing issues, enhancing training efficiency and model stability. Overall, MoDULA provides a scalable, cost-effective solution for fine-tuning LLMs with enhanced parameter efficiency and generalization capability.

Paper Structure

This paper contains 14 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustrations of MoLoRA(a), MoDULA-Flan(b), and MoDULA-Res(c) with router omitted.
  • Figure 2: Illustrations of the three-stage training paradigm for MoDULA-Res.
  • Figure 3: Router distributions of MoDULA-Res based on Yi-6B (left) and Qwen-14B (right) on domain-specific tasks.
  • Figure 4: The GPT-4 judge prompt for Title-Optimization task.
  • Figure 5: The GPT-4 judge prompt for Keyword-Recommendation task.