Table of Contents
Fetching ...

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models

Yuxuan Zhang, Ruizhe Li

TL;DR

DLP-LoRA addresses the inefficiency of dynamic LoRA fusion for multi-task adaptation in LLMs by introducing a 5M-parameter mini-MLP plugin that performs sentence-level LoRA selection via top-$p$ sampling. This replaces heavier token-level MoE routers and enables parallel fusion of multiple LoRAs with less than a 2x slowdown vs single LoRA inference, achieving strong performance across 26 tasks (MCQ and QA) on multiple backbones. The approach demonstrates strong accuracy gains, competitive BLEU/ROUGE metrics, and scalable efficiency, even as the number of LoRAs grows, due to its lightweight routing and parallelization. These results suggest that dynamic, context-aware LoRA fusion at the sentence level is a practical and effective strategy for flexible, resource-efficient multi-task adaptation of LLMs in real-world settings.

Abstract

Recent advancements in Large Language Models (LLMs) have achieved robust performance across diverse tasks, but fine-tuning these models for specific domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a small subset of parameters. However, existing methods for fusing multiple LoRAs lack dynamic fusion based on contextual inputs and often increase inference time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight Plugin that employs a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-p sampling strategies. This approach reduces inference time to less than twice that of single LoRA inference by leveraging parallel computation. Evaluations across 26 tasks-including multiple-choice questions and question answering-demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets, outperforming different LLMs backbones under composite task settings. DLP-LoRA effectively balances performance and efficiency, making it a practical solution for dynamic multi-task adaptation in LLMs. Our code is available at https://github.com/MeCuping/DLP-LoRA.

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models

TL;DR

DLP-LoRA addresses the inefficiency of dynamic LoRA fusion for multi-task adaptation in LLMs by introducing a 5M-parameter mini-MLP plugin that performs sentence-level LoRA selection via top- sampling. This replaces heavier token-level MoE routers and enables parallel fusion of multiple LoRAs with less than a 2x slowdown vs single LoRA inference, achieving strong performance across 26 tasks (MCQ and QA) on multiple backbones. The approach demonstrates strong accuracy gains, competitive BLEU/ROUGE metrics, and scalable efficiency, even as the number of LoRAs grows, due to its lightweight routing and parallelization. These results suggest that dynamic, context-aware LoRA fusion at the sentence level is a practical and effective strategy for flexible, resource-efficient multi-task adaptation of LLMs in real-world settings.

Abstract

Recent advancements in Large Language Models (LLMs) have achieved robust performance across diverse tasks, but fine-tuning these models for specific domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a small subset of parameters. However, existing methods for fusing multiple LoRAs lack dynamic fusion based on contextual inputs and often increase inference time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight Plugin that employs a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-p sampling strategies. This approach reduces inference time to less than twice that of single LoRA inference by leveraging parallel computation. Evaluations across 26 tasks-including multiple-choice questions and question answering-demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets, outperforming different LLMs backbones under composite task settings. DLP-LoRA effectively balances performance and efficiency, making it a practical solution for dynamic multi-task adaptation in LLMs. Our code is available at https://github.com/MeCuping/DLP-LoRA.
Paper Structure (31 sections, 5 equations, 3 figures, 17 tables)

This paper contains 31 sections, 5 equations, 3 figures, 17 tables.

Figures (3)

  • Figure 1: DLP-LoRA framework: different LoRAs will be activated based on the input task and sentence via mini-MLP plugin. When Top-$p$ sampling is used via the mini-MLP plugin, multiple LoRAs will be sampled and fused with probability $p$ as the threshold. DLP-LoRA fusion is only enabled once the first token of every new sentence is generated.
  • Figure 2: The performance of DLP-LoRA compared to 7 LoRA baselines using Qwen-2 1.5B (left) and LLaMA-3 8B (right) backbones across 26 tasks. See Appendix \ref{['app:all_results']} for more results using Qwen-2 7B and LLaMA-2 7B LLMs backbones.
  • Figure 3: Radar chart of Qwen-2 7B and LLaMA-2 7B across 18 MCQ and 8 QA tasks.