Table of Contents
Fetching ...

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Hongyu Chen, Letian Ruan, Zilin Xu, Yuchen Li, Xinyu Chen, Jingwen Leng, Bingsheng He, Minyi Guo, Shixuan Sun

Abstract

LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Abstract

LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.

Paper Structure

This paper contains 37 sections, 6 equations, 19 figures, 4 tables, 1 algorithm.

Figures (19)

  • Figure 1: (Top) LoRA cache capacity across model architectures. (Bottom) Scale-out vs. scale-up performance.
  • Figure 2: Prefill–decode disaggregated architecture. LLM instances are deployed with 2 GPUs using expert parallelism.
  • Figure 3: LoRA computation on Dense and MoE models.
  • Figure 4: Coupled-design multi-LoRA serving architecture.
  • Figure 5: Impact of LoRA cache ratio on TTFT performance and SLO attainment. (Left) P95 TTFT under varying cache ratios, with SLO of 0.25 seconds. (Right) Percentage of LoRA adapters for which the fraction of requests meeting the TTFT SLO exceeds specific thresholds (50%, 80%, and 90%).
  • ...and 14 more figures