Table of Contents
Fetching ...

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference

Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang

TL;DR

CaraServe addresses multi-tenant LoRA serving for generative LLM inference by keeping the base model on GPUs while offloading LoRA adapters from host memory and overlapping adapter loading with CPU-based prefill. It introduces a sync-free, CPU-assisted execution path, shared-memory IPC, and profiling-guided CPU parallelization, together with a rank-aware scheduling model that predicts batch latency for heterogeneous adapters and routes requests to minimize SLO violations. Empirical results on Llama2-7B/13B/70B show up to 1.4x average latency improvement and up to 99% SLO attainment, with further gains in multi-GPU settings. Overall, CaraServe provides a scalable, GPU-efficient, cold-start-free solution for multi-tenant LoRA serving that can be integrated with existing LLM pipelines and extended to various models.

Abstract

Pre-trained large language models (LLMs) often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dynamically loads activated LoRA adapters from main memory. As GPU loading results in a cold-start that substantially delays token generation, CaraServe employs a CPU-assisted approach. It early starts the activated adapters on CPUs for prefilling as they are being loaded onto GPUs; after loading completes, it then switches to the GPUs for generative LoRA inference. CaraServe develops a highly optimized synchronization mechanism to efficiently coordinate LoRA computation on the CPU and GPU. Moreover, CaraServe employs a rank-aware scheduling algorithm to optimally schedule heterogeneous LoRA requests for maximum service-level objective (SLO) attainment. We have implemented CaraServe and evaluated it against state-of-the-art LoRA serving systems. Our results demonstrate that CaraServe can speed up the average request serving latency by up to 1.4$\times$ and achieve an SLO attainment of up to 99%.

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference

TL;DR

CaraServe addresses multi-tenant LoRA serving for generative LLM inference by keeping the base model on GPUs while offloading LoRA adapters from host memory and overlapping adapter loading with CPU-based prefill. It introduces a sync-free, CPU-assisted execution path, shared-memory IPC, and profiling-guided CPU parallelization, together with a rank-aware scheduling model that predicts batch latency for heterogeneous adapters and routes requests to minimize SLO violations. Empirical results on Llama2-7B/13B/70B show up to 1.4x average latency improvement and up to 99% SLO attainment, with further gains in multi-GPU settings. Overall, CaraServe provides a scalable, GPU-efficient, cold-start-free solution for multi-tenant LoRA serving that can be integrated with existing LLM pipelines and extended to various models.

Abstract

Pre-trained large language models (LLMs) often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dynamically loads activated LoRA adapters from main memory. As GPU loading results in a cold-start that substantially delays token generation, CaraServe employs a CPU-assisted approach. It early starts the activated adapters on CPUs for prefilling as they are being loaded onto GPUs; after loading completes, it then switches to the GPUs for generative LoRA inference. CaraServe develops a highly optimized synchronization mechanism to efficiently coordinate LoRA computation on the CPU and GPU. Moreover, CaraServe employs a rank-aware scheduling algorithm to optimally schedule heterogeneous LoRA requests for maximum service-level objective (SLO) attainment. We have implemented CaraServe and evaluated it against state-of-the-art LoRA serving systems. Our results demonstrate that CaraServe can speed up the average request serving latency by up to 1.4 and achieve an SLO attainment of up to 99%.
Paper Structure (19 sections, 4 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: Illustration of CPU-assisted LoRA serving.
  • Figure 2: Continuous batching in which the decoding phase (Dec) is preempted to perform prompt processing upon a request arrival, which involves loading the requested LoRA adapter (Load) and prefilling (Pre).
  • Figure 3: Left: The distribution of cold-start overhead during the entire token generation of each request. Right: The cold-start latency of loading a single LoRA adapter of different rank onto GPU. The adapter applies to the $\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v$ of a Llama2-7B on an A10 GPU instance.
  • Figure 4: The varying decoding latency of batching heterogeneous LoRA adapters. Left: The performance of Punica's BGMVchen2023punica is determined by the batch size and the maximum rank. Right: The performance of S-LoRA's MBGMVsheng2023slora depends on the batch size and the average rank in the batch.
  • Figure 5: An example of rank-aware LoRA scheduling with a decoding latency SLO of 36 ms. With Punica's BGMV, scheduling the new request to Instance 2 meets the SLO; with S-LoRA's MBGMV, scheduling it to Instance 1 preserves the SLO.
  • ...and 15 more figures