Table of Contents
Fetching ...

LLM Inference Serving: Survey of Recent Advances and Opportunities

Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari

TL;DR

The paper targets system-level innovations for LLM serving since 2023, emphasizing memory management, computation scheduling, cloud deployment, and emerging research areas rather than decoding changes. It consolidates high-quality work across ML and systems venues to present practical approaches for scalable, low-latency inference in production. Core contributions include KV-cache optimization, long-context strategies, continuous batching and disaggregated inference, model-parallelism techniques, and RAG/MoE inference with efficiency and deployment considerations. The findings guide practitioners in selecting and deploying efficient LLM serving stacks in real-world settings, from on-premises to cloud and edge environments.

Abstract

This survey offers a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since the year 2023. We specifically examine system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. By selecting and reviewing high-quality papers from prestigious ML and system venues, we highlight key innovations and practical considerations for deploying and scaling LLMs in real-world production environments. This survey serves as a valuable resource for LLM practitioners seeking to stay abreast of the latest developments in this rapidly evolving field.

LLM Inference Serving: Survey of Recent Advances and Opportunities

TL;DR

The paper targets system-level innovations for LLM serving since 2023, emphasizing memory management, computation scheduling, cloud deployment, and emerging research areas rather than decoding changes. It consolidates high-quality work across ML and systems venues to present practical approaches for scalable, low-latency inference in production. Core contributions include KV-cache optimization, long-context strategies, continuous batching and disaggregated inference, model-parallelism techniques, and RAG/MoE inference with efficiency and deployment considerations. The findings guide practitioners in selecting and deploying efficient LLM serving stacks in real-world settings, from on-premises to cloud and edge environments.

Abstract

This survey offers a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since the year 2023. We specifically examine system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. By selecting and reviewing high-quality papers from prestigious ML and system venues, we highlight key innovations and practical considerations for deploying and scaling LLMs in real-world production environments. This survey serves as a valuable resource for LLM practitioners seeking to stay abreast of the latest developments in this rapidly evolving field.
Paper Structure (20 sections, 3 equations, 2 figures, 1 algorithm)

This paper contains 20 sections, 3 equations, 2 figures, 1 algorithm.

Figures (2)

  • Figure 1: Transformer-based LLM architecture including both the multi-head attention mechanism and feed-forward network.
  • Figure 2: Prefill and decoding phase in the LLM inference.