Table of Contents
Fetching ...

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

Himel Ghosh

TL;DR

The ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models, is reviewed, which significantly improves performance and scalability in serverless environments for LLM workloads.

Abstract

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6--8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

TL;DR

The ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models, is reviewed, which significantly improves performance and scalability in serverless environments for LLM workloads.

Abstract

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6--8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.

Paper Structure

This paper contains 19 sections, 7 figures.

Figures (7)

  • Figure 1: Serverless Computing Architecture Ghorbian2024
  • Figure 2: Cold Start coldstart
  • Figure 3: Flowchart to illustrate the methods of the ServerlessLLM to mitigate Cold start problem.
  • Figure 4: The multitier storage design to mitigate cold start issues
  • Figure 5: Diagram that illustrates the Live Migration process.
  • ...and 2 more figures