MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models
Jingwei Xu, Junyu Lai, Yunpeng Huang
TL;DR
MeteoRA addresses the challenge of deploying many LoRA adapters within a single base LLM by reusing adapters through a full-mode Mixture-of-Experts with trainable gating networks that autonomously select relevant adapters per token. It introduces forward-acceleration techniques (bmm-torch and a Triton kernel) to mitigate MoE inefficiency, and demonstrates on LLaMA2-13B and LLaMA3-8B with 28 adapters that MeteoRA matches traditional PEFT performance while excelling in composite tasks due to timely adapter switching. The results show MeteoRA achieving comparable accuracy and BLEU/ROUGE scores to PEFT across 28 tasks and outperforming it on composite-n evaluations, highlighting practical benefits for cross-domain, sequential problem-solving. Overall, MeteoRA offers a scalable, efficient pathway to leverage off-the-shelf LoRA adapters in autonomous, multi-task LLM deployments.
Abstract
The pretrain+fine-tune paradigm is foundational for deploying large language models (LLMs) across various downstream applications. Within this framework, Low-Rank Adaptation (LoRA) stands out for its parameter-efficient fine-tuning (PEFT), producing numerous reusable task-specific LoRA adapters. However, this approach requires explicit task intention selection, posing challenges for autonomous task sensing and switching during inference with multiple existing LoRA adapters embedded in a single LLM. In this work, we introduce MeteoRA (Multiple-tasks embedded LoRA), a scalable and efficient framework that reuses multiple task-specific LoRA adapters into the base LLM via a full-mode Mixture-of-Experts (MoE) architecture. This framework also includes novel MoE forward acceleration strategies to address the efficiency challenges of traditional MoE implementations. Our evaluation, using the LlaMA2-13B and LlaMA3-8B base models equipped with 28 existing LoRA adapters through MeteoRA, demonstrates equivalent performance with the traditional PEFT method. Moreover, the LLM equipped with MeteoRA achieves superior performance in handling composite tasks, effectively solving ten sequential problems in a single inference pass, thereby demonstrating the framework's enhanced capability for timely adapter switching.
