λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
Minchen Yu, Rui Yang, Chaobo Jia, Zhaoyuan Su, Sheng Yao, Tingfeng Lan, Yuchen Yang, Yue Cheng, Wei Wang, Ao Wang, Ruichuan Chen
TL;DR
The paper addresses slow scaling of serverless LLM inference due to cold-start bottlenecks by introducing λScale, which employs cross-node GPUDirect RDMA multicast and an execute-while-load paradigm. Its core innovation, λPipe, partitions models into blocks and builds adaptive execution pipelines across receiving nodes to enable distributed inference during model loading. It couples this with locality-aware model startup and memory management to handle models across GPU and host memory, achieving significant tail-latency reductions and cost savings on real-world traces. Experimental results show 2.4x–5x tail-latency improvements and up to 31.3% GPU-cost reductions, plus sub-second scaling for large models across multiple nodes, highlighting practical impact for bursty serverless inference workloads.
Abstract
Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce λScale, an efficient serverless inference system to achieve fast model scaling. The key idea behind λScale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". λScale proposes an efficient model scaling scheme, λPipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, λScale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that λScale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.
