Table of Contents
Fetching ...

Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving

Yuchen Zhang, Hanyue Du, Chun Cao, Jingwei Xu

TL;DR

Loquetier presents a unified, virtualized framework for jointly fine-tuning and serving LoRA-based LLMs within a single runtime. It introduces a Virtualized Module to isolate adapter modifications and a Segmented Multi-LoRA Multiplication (SMLM) kernel to batched, multi-adapter computation across forward and backward passes. The approach enables concurrent handling of multiple LoRA adapters with dynamic loading, migration, and efficient resource usage, achieving substantial throughput and SLO improvements over baselines in inference, fine-tuning, and unified tasks. This work has practical implications for deploying scalable, multi-task PEFT workflows in production settings, with publicly available code to facilitate reproducibility and adoption.

Abstract

Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning (PEFT) technique for adapting large language models (LLMs) to downstream tasks. While prior work has explored strategies for integrating LLM training and serving, there still remains a gap in unifying fine-tuning and inference for LoRA-based models. We present Loquetier, a virtualized multi-LoRA framework that seamlessly integrates LoRA fine-tuning and serving within a single runtime. Loquetier introduces two key components: (1) a Virtualized Module that isolates PEFT-based modifications and supports multiple adapters on a shared base model, and (2) an optimized computation flow with a kernel design that merges fine-tuning and inference paths in forward propagation, enabling efficient batching and minimizing kernel invocation overhead. Extensive experiments across three task settings show that Loquetier consistently outperforms existing baselines in both performance and flexibility, achieving up to $3.0\times$ the throughput of the state-of-the-art co-serving system on inference-only tasks and $46.4\times$ higher SLO attainment than PEFT on unified fine-tuning and inference tasks. The implementation of Loquetier is publicly available at https://github.com/NJUDeepEngine/Loquetier.

Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving

TL;DR

Loquetier presents a unified, virtualized framework for jointly fine-tuning and serving LoRA-based LLMs within a single runtime. It introduces a Virtualized Module to isolate adapter modifications and a Segmented Multi-LoRA Multiplication (SMLM) kernel to batched, multi-adapter computation across forward and backward passes. The approach enables concurrent handling of multiple LoRA adapters with dynamic loading, migration, and efficient resource usage, achieving substantial throughput and SLO improvements over baselines in inference, fine-tuning, and unified tasks. This work has practical implications for deploying scalable, multi-task PEFT workflows in production settings, with publicly available code to facilitate reproducibility and adoption.

Abstract

Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning (PEFT) technique for adapting large language models (LLMs) to downstream tasks. While prior work has explored strategies for integrating LLM training and serving, there still remains a gap in unifying fine-tuning and inference for LoRA-based models. We present Loquetier, a virtualized multi-LoRA framework that seamlessly integrates LoRA fine-tuning and serving within a single runtime. Loquetier introduces two key components: (1) a Virtualized Module that isolates PEFT-based modifications and supports multiple adapters on a shared base model, and (2) an optimized computation flow with a kernel design that merges fine-tuning and inference paths in forward propagation, enabling efficient batching and minimizing kernel invocation overhead. Extensive experiments across three task settings show that Loquetier consistently outperforms existing baselines in both performance and flexibility, achieving up to the throughput of the state-of-the-art co-serving system on inference-only tasks and higher SLO attainment than PEFT on unified fine-tuning and inference tasks. The implementation of Loquetier is publicly available at https://github.com/NJUDeepEngine/Loquetier.

Paper Structure

This paper contains 21 sections, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: The framework diagram of Loquetier.
  • Figure 2: Comparison of the performance of Loquetier, FlexLLM, S-LoRA and PEFT in inference tasks. The upper is single LoRA model inference and the lower part is multiple LoRA model inference. Partial means that only 3 modules are enabled for FlexLLM including up, gate, and down. For detailed information on S-LoRA, please refer to the Appendix \ref{['sec:slora']}. Full means that all 7 modules are enabled, including q, k, v, o, up, gate, and down. $\times$ indicates that the results were not obtained: FlexLLM does not support enabling LoRA modules for linear layers other than up, gate, and down; FlexLLM cycles through loading LoRA models during multi-LoRA inference.
  • Figure 3: Comparison of the performance of Loquetier, FlexLLM and PEFT in fine-tuning tasks. The meanings of Partial and Full are the same as in Figure \ref{['fig:test-infer']}. $\times$ indicates that the results were not obtained: FlexLLM does not support backward propagation computations for modules other than up, gate and down. PEFT can only finetune one LoRA adapter at a time, so its time cost is cumulative.
  • Figure 4: Comparison of the performance of Loquetier and PEFT in unified tasks. The 4 subplots correspond respectively to single-finetune & single-infer, single-finetune & multi-infer, multi-finetune & single-infer, and multi-finetune & multi-infer. The meanings of Partial and Full are the same as in Figure \ref{['fig:test-infer']}. $\times$ indicates that the results were not obtained: FlexLLM and PEFT can only finetune 1 LoRA at a time due to GPU memory limitations, causing it to fail the multi-LoRA fine-tuning scenarios; FlexLLM only support 3 target modules as mentioned in previous figures.
  • Figure 5: Performance of Loquetier under dynamic load in unified task.
  • ...and 1 more figures