Table of Contents
Fetching ...

λScale: Enabling Fast Scaling for Serverless Large Language Model Inference

Minchen Yu, Rui Yang, Chaobo Jia, Zhaoyuan Su, Sheng Yao, Tingfeng Lan, Yuchen Yang, Yue Cheng, Wei Wang, Ao Wang, Ruichuan Chen

TL;DR

The paper addresses slow scaling of serverless LLM inference due to cold-start bottlenecks by introducing λScale, which employs cross-node GPUDirect RDMA multicast and an execute-while-load paradigm. Its core innovation, λPipe, partitions models into blocks and builds adaptive execution pipelines across receiving nodes to enable distributed inference during model loading. It couples this with locality-aware model startup and memory management to handle models across GPU and host memory, achieving significant tail-latency reductions and cost savings on real-world traces. Experimental results show 2.4x–5x tail-latency improvements and up to 31.3% GPU-cost reductions, plus sub-second scaling for large models across multiple nodes, highlighting practical impact for bursty serverless inference workloads.

Abstract

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce λScale, an efficient serverless inference system to achieve fast model scaling. The key idea behind λScale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". λScale proposes an efficient model scaling scheme, λPipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, λScale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that λScale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.

λScale: Enabling Fast Scaling for Serverless Large Language Model Inference

TL;DR

The paper addresses slow scaling of serverless LLM inference due to cold-start bottlenecks by introducing λScale, which employs cross-node GPUDirect RDMA multicast and an execute-while-load paradigm. Its core innovation, λPipe, partitions models into blocks and builds adaptive execution pipelines across receiving nodes to enable distributed inference during model loading. It couples this with locality-aware model startup and memory management to handle models across GPU and host memory, achieving significant tail-latency reductions and cost savings on real-world traces. Experimental results show 2.4x–5x tail-latency improvements and up to 31.3% GPU-cost reductions, plus sub-second scaling for large models across multiple nodes, highlighting practical impact for bursty serverless inference workloads.

Abstract

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce λScale, an efficient serverless inference system to achieve fast model scaling. The key idea behind λScale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". λScale proposes an efficient model scaling scheme, λPipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, λScale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that λScale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.

Paper Structure

This paper contains 23 sections, 18 figures, 1 table, 2 algorithms.

Figures (18)

  • Figure 1: Normalized request rates of two representative serverless inference services. Trace 1 (top): a 12-hour serverless inference trace collected from Alibaba Cloud. Trace 2 (bottom): a 12-hour trace from a real-world LLM workload burstGPT_arxiv24.
  • Figure 2: Distribution of models' keep-alive time in memory.
  • Figure 3: Proportion of the 3 types of model loading.
  • Figure 4: $\lambda$Scale architecture overview. In this example, Node A initiates a binomial pipeline multicast per rdmcbinomial-pipe for a model partitioned into three model blocks. For a detailed hypercube binomial pipeline illustration, please refer to RDMC rdmc. Each participating worker node transmits a model block in a sequential step (indicated by numbered labels along the data flow arrows). A receiver node forwards the blocks it has received to its neighbors (e.g., Node B forwards block a at steps 2 and 3). The color-coded model blocks correspond to the data flow paths.
  • Figure 5: Example of $2 \rightarrow 8$ scaling. Each sub-group transfers model chunks in circularly shifted order in parallel, to construct three pipeline parallel inference execution flows (Node 3 and 6, Node 4 and 7, and Node 5 and 8). This strategy allows multiple execution pipelines to be started as soon as enough model blocks are distributed while enabling non-blocking binomial pipeline multicast rdmcbinomial-pipe.
  • ...and 13 more figures