Table of Contents
Fetching ...

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Yifan Sui, Hao Wang, Hanfei Yu, Yitao Hu, Jianxun Li, Hao Wang

TL;DR

ServerlessLoRA tackles the inefficiencies of serverless LoRA inference by enabling backbone sharing across isolated function instances, extending pre-loading to cover LoRA artifacts, and introducing contention-aware adaptive batching with dynamic GPU offloading. It formulates pre-loading as a $Precedence-Constrained Knapsack Problem$ ($PCKP$) and uses CUDA IPC to share backbone tensors, enabling unmerged LoRA inference while preserving strict function isolation; the design achieves fast startup and high throughput under bursty workloads, while reducing GPU usage and cost. Empirical results on industrial traces with Llama2-7B/13B LoRA adapters show TTFT reductions up to 86% and monetary-cost reductions up to 89%, with substantial throughput gains and low SLO violation rates, demonstrating practical impact for multi-tenant LoRA serving. The work provides a scalable, secure framework for serverless LoRA inference, enabling efficient deployment of many LoRA variants on shared backbones with minimal overhead.

Abstract

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design a pre-loading method that pre-loads comprehensive LoRA artifacts to minimize cold-start latency. Furthermore, ServerlessLoRA employs contention aware batching and offloading to mitigate GPU resource conflicts during bursty workloads. Experiment on industrial workloads demonstrates that ServerlessLoRA reduces TTFT by up to 86% and cuts monetary costs by up to 89% compared to state-of-the-art LLM inference solutions.

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

TL;DR

ServerlessLoRA tackles the inefficiencies of serverless LoRA inference by enabling backbone sharing across isolated function instances, extending pre-loading to cover LoRA artifacts, and introducing contention-aware adaptive batching with dynamic GPU offloading. It formulates pre-loading as a () and uses CUDA IPC to share backbone tensors, enabling unmerged LoRA inference while preserving strict function isolation; the design achieves fast startup and high throughput under bursty workloads, while reducing GPU usage and cost. Empirical results on industrial traces with Llama2-7B/13B LoRA adapters show TTFT reductions up to 86% and monetary-cost reductions up to 89%, with substantial throughput gains and low SLO violation rates, demonstrating practical impact for multi-tenant LoRA serving. The work provides a scalable, secure framework for serverless LoRA inference, enabling efficient deployment of many LoRA variants on shared backbones with minimal overhead.

Abstract

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design a pre-loading method that pre-loads comprehensive LoRA artifacts to minimize cold-start latency. Furthermore, ServerlessLoRA employs contention aware batching and offloading to mitigate GPU resource conflicts during bursty workloads. Experiment on industrial workloads demonstrates that ServerlessLoRA reduces TTFT by up to 86% and cuts monetary costs by up to 89% compared to state-of-the-art LLM inference solutions.

Paper Structure

This paper contains 28 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Time breakdown of LoRA functions' invocations.
  • Figure 2: Cost-effectiveness of serverless and serverful solutions (we set vLLM as baseline).
  • Figure 3: System overview.
  • Figure 4: Backbone LLM sharing among function instances.
  • Figure 5: Trace example of "Predictable" (CoV $\leq1$), "Normal" ($1<$ CoV $\leq4$), and "Bursty" request arrival pattern.
  • ...and 7 more figures