Table of Contents
Fetching ...

MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models

Jingwei Xu, Junyu Lai, Yunpeng Huang

TL;DR

MeteoRA addresses the challenge of deploying many LoRA adapters within a single base LLM by reusing adapters through a full-mode Mixture-of-Experts with trainable gating networks that autonomously select relevant adapters per token. It introduces forward-acceleration techniques (bmm-torch and a Triton kernel) to mitigate MoE inefficiency, and demonstrates on LLaMA2-13B and LLaMA3-8B with 28 adapters that MeteoRA matches traditional PEFT performance while excelling in composite tasks due to timely adapter switching. The results show MeteoRA achieving comparable accuracy and BLEU/ROUGE scores to PEFT across 28 tasks and outperforming it on composite-n evaluations, highlighting practical benefits for cross-domain, sequential problem-solving. Overall, MeteoRA offers a scalable, efficient pathway to leverage off-the-shelf LoRA adapters in autonomous, multi-task LLM deployments.

Abstract

The pretrain+fine-tune paradigm is foundational for deploying large language models (LLMs) across various downstream applications. Within this framework, Low-Rank Adaptation (LoRA) stands out for its parameter-efficient fine-tuning (PEFT), producing numerous reusable task-specific LoRA adapters. However, this approach requires explicit task intention selection, posing challenges for autonomous task sensing and switching during inference with multiple existing LoRA adapters embedded in a single LLM. In this work, we introduce MeteoRA (Multiple-tasks embedded LoRA), a scalable and efficient framework that reuses multiple task-specific LoRA adapters into the base LLM via a full-mode Mixture-of-Experts (MoE) architecture. This framework also includes novel MoE forward acceleration strategies to address the efficiency challenges of traditional MoE implementations. Our evaluation, using the LlaMA2-13B and LlaMA3-8B base models equipped with 28 existing LoRA adapters through MeteoRA, demonstrates equivalent performance with the traditional PEFT method. Moreover, the LLM equipped with MeteoRA achieves superior performance in handling composite tasks, effectively solving ten sequential problems in a single inference pass, thereby demonstrating the framework's enhanced capability for timely adapter switching.

MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models

TL;DR

MeteoRA addresses the challenge of deploying many LoRA adapters within a single base LLM by reusing adapters through a full-mode Mixture-of-Experts with trainable gating networks that autonomously select relevant adapters per token. It introduces forward-acceleration techniques (bmm-torch and a Triton kernel) to mitigate MoE inefficiency, and demonstrates on LLaMA2-13B and LLaMA3-8B with 28 adapters that MeteoRA matches traditional PEFT performance while excelling in composite tasks due to timely adapter switching. The results show MeteoRA achieving comparable accuracy and BLEU/ROUGE scores to PEFT across 28 tasks and outperforming it on composite-n evaluations, highlighting practical benefits for cross-domain, sequential problem-solving. Overall, MeteoRA offers a scalable, efficient pathway to leverage off-the-shelf LoRA adapters in autonomous, multi-task LLM deployments.

Abstract

The pretrain+fine-tune paradigm is foundational for deploying large language models (LLMs) across various downstream applications. Within this framework, Low-Rank Adaptation (LoRA) stands out for its parameter-efficient fine-tuning (PEFT), producing numerous reusable task-specific LoRA adapters. However, this approach requires explicit task intention selection, posing challenges for autonomous task sensing and switching during inference with multiple existing LoRA adapters embedded in a single LLM. In this work, we introduce MeteoRA (Multiple-tasks embedded LoRA), a scalable and efficient framework that reuses multiple task-specific LoRA adapters into the base LLM via a full-mode Mixture-of-Experts (MoE) architecture. This framework also includes novel MoE forward acceleration strategies to address the efficiency challenges of traditional MoE implementations. Our evaluation, using the LlaMA2-13B and LlaMA3-8B base models equipped with 28 existing LoRA adapters through MeteoRA, demonstrates equivalent performance with the traditional PEFT method. Moreover, the LLM equipped with MeteoRA achieves superior performance in handling composite tasks, effectively solving ten sequential problems in a single inference pass, thereby demonstrating the framework's enhanced capability for timely adapter switching.
Paper Structure (22 sections, 7 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our proposed framework provides a full-mode MoE architecture that directly reuses various off-the-shelf LoRA adapters, enhancing the LLM's ability to timely and autonomously activate appropriate adapters for the input. MeteoRA modules could be integrated into all basic linear layers of both Attention and MLP modules. With the MoE forward acceleration strategies, LLM equipped with MeteoRA could be capable of addressing tasks across a wide range of domains effectively.
  • Figure 2: The architecture of MeteoRA module with MoE-style LoRA embedding. MeteoRA directly reuses existing LoRA adapters without fine-tuning and only requires training the Gating network.
  • Figure 3: Evaluation results on the 28 selected tasks. The MeteoRA performs similarly on most tasks, leading to high overlap between the two polygons in the radar graphs. For clarity, we only draw results from MeteoRA with top-1 strategy in the radar graphs. Detailed results for each individual task are available in Appendix \ref{['apppendix:28_task_results']}
  • Figure 4: An example of composite-3 task. We highlight the statistically dominant LoRA selected by MeteoRA in token level (decoded to words). The result shows that LLM with MeteoRA could achieve timely LoRA switching on both phases of input understanding and output generation. The background color gets darker when Gating network assigns a higher weight value.
  • Figure 5: The overall root-of-runtime of four forward pass designs on 28 different Big-Bench subtasks.
  • ...and 3 more figures