Punica: Multi-Tenant LoRA Serving

Lequn Chen; Zihao Ye; Yongji Wu; Danyang Zhuo; Luis Ceze; Arvind Krishnamurthy

Punica: Multi-Tenant LoRA Serving

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy

TL;DR

Punica delivers multi-tenant LoRA serving by introducing Segmented Gather Matrix-Vector Multiplication (SGMV), a CUDA kernel that batches computations across different LoRA models sharing a backbone. A scheduler consolidates workloads on a fixed GPU cluster, while on-demand LoRA loading and a paged KvCache layout enable fast cold-starts and memory-efficient decoding. Throughput improvements up to 12x and only ~2 ms per-token latency are demonstrated across 7B/13B/70B models, with cluster deployment showing strong scaling and consolidation behavior. The work contributes a novel kernel, new scheduling strategies, and an end-to-end implementation that integrates with existing frameworks and supports open-source deployment.

Abstract

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica .

Punica: Multi-Tenant LoRA Serving

TL;DR

Abstract

Paper Structure (31 sections, 4 equations, 13 figures)

This paper contains 31 sections, 4 equations, 13 figures.

Introduction
Background
Transformer and Text Generation
Low-Rank Adaptation (LoRA)
How to serve multi-tenant LoRA models efficiently on a shared GPU cluster?
Punica Overview
Segmented Gather Matrix-Vector Multiplication
CUDA Kernel Schedule
Punica in Detail
Scheduling new requests
On-demand model loading
Request migration
Memory layout for KvCache
Implementation
Python Library
...and 16 more sections

Figures (13)

Figure 1: Batching effects in Prefill stage and in Decode stage
Figure 2: The system architecture of Punica.
Figure 3: Semantics of SGMV.
Figure 4: Scheduling of SGMV expand/shrink kernels
Figure 5: Request migration procedure for Request $R_3$.
...and 8 more figures

Punica: Multi-Tenant LoRA Serving

TL;DR

Abstract

Punica: Multi-Tenant LoRA Serving

Authors

TL;DR

Abstract

Table of Contents

Figures (13)