ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Yao Fu; Leyang Xue; Yeqi Huang; Andrei-Octavian Brabete; Dmitrii Ustiugov; Yuvraj Patel; Luo Mai

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai

TL;DR

ServerlessLLM tackles the high startup latency of serverless LLM inference by exploiting a GPU server's multi-tier local storage to cache and load checkpoints near the compute. It introduces three key innovations: a loading-optimized checkpoint format with a fast multi-tier loading subsystem, a token-based live migration mechanism to preserve locality with minimal data transfer, and a startup-time-aware model scheduler that estimates loading and migration times to minimize startup latency. Empirical results show substantial improvements, including $3.6$–$8.2\times$ faster checkpoint loading, up to $10$–$200\times$ lower end-to-end latency in real workloads, and up to $212\times$ improvements over baselines for large models, demonstrating the practicality of locality-aware, near-GPU checkpointing and migration for scalable serverless LLM services. The work lays a foundation for pay-as-you-go, highly responsive LLM inference at scale, with open-source release and avenues for future enhancements such as checkpoint placement and fairness-aware scheduling.

Abstract

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) \emph{fast multi-tier checkpoint loading}, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) \emph{efficient live migration of LLM inference}, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) \emph{startup-time-optimized model scheduling}, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

TL;DR

–

faster checkpoint loading, up to

–

lower end-to-end latency in real workloads, and up to

improvements over baselines for large models, demonstrating the practicality of locality-aware, near-GPU checkpointing and migration for scalable serverless LLM services. The work lays a foundation for pay-as-you-go, highly responsive LLM inference at scale, with open-source release and avenues for future enhancements such as checkpoint placement and fairness-aware scheduling.

Abstract

Paper Structure (29 sections, 12 figures)

This paper contains 29 sections, 12 figures.

Introduction
Background and Motivation
Why Serverless Inference for LLMs
Serverless Cluster and LLM Inference
Challenges with Serverless LLM Inference
Existing Solutions and Associated Issues
Exploiting In-Server Multi-Tier Storage
Design Intuitions
Design Concerns and Overview
Fast Multi-Tier Checkpoint Loading
Loading-Optimized Checkpoints
Multi-Tier Loading Subsystem
Efficient Live Migration of LLM Inference
Need for Live Migration
Making Live Migration Efficient
...and 14 more sections

Figures (12)

Figure 1: Overview of GPU serverless clusters, LLM inference and new designs introduced by ServerlessLLM.
Figure 2: Components in fast multi-tier checkpoint loading.
Figure 3: Analysis of different locality-driven policies
Figure 4: Live migration process for LLM inference
Figure 5: Overview of the model loading scheduler design
...and 7 more figures

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

TL;DR

Abstract

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)