Table of Contents
Fetching ...

A System for Microserving of LLMs

Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen

TL;DR

LLM microserving introduces a multi-level framework that decouples orchestration from execution to enable dynamic reconfiguration of LLM inference across multiple GPUs and nodes. By combining a programmable router with three fine-grained REST APIs and a unified KV cache interface, the approach supports both traditional and novel disaggregation patterns, including balanced prefill-decode and context cache migration, with minimal code changes. End-to-end implementation on MLC-LLM using NVSHMEM demonstrates state-of-the-art performance and practical speedups (up to 47% reductions in job completion time in certain workloads). This framework enables rapid experimentation with scheduling strategies and paves the way for more adaptive, production-grade LLM serving systems.

Abstract

The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as prefill-decode disaggregation and context migration, arise in serving systems. Most inference services today expose a coarse-grained request-level API with a pre-configured coordination strategy, limiting the ability to customize and dynamically reconfigure the coordination. In this paper, we propose LLM microserving, a multi-level architecture for structuring and programming LLM inference services. We introduces simple yet effective microserving APIs to support fine-grained sub-request level actions. A programmable router transforms user requests into sub-request calls, enabling the dynamic reconfiguration of serving patterns. To support diverse execution patterns, we develop a unified KV cache interface that handles various KV compute, transfer, and reuse scenarios. Our evaluation shows that LLM microserving can be reconfigured to support multiple disaggregation orchestration strategies in a few lines of Python code while maintaining state-of-the-art performance for LLM inference tasks. Additionally, it allows us to explore new strategy variants that reduce up to 47% of job completion time compared to the existing strategies.

A System for Microserving of LLMs

TL;DR

LLM microserving introduces a multi-level framework that decouples orchestration from execution to enable dynamic reconfiguration of LLM inference across multiple GPUs and nodes. By combining a programmable router with three fine-grained REST APIs and a unified KV cache interface, the approach supports both traditional and novel disaggregation patterns, including balanced prefill-decode and context cache migration, with minimal code changes. End-to-end implementation on MLC-LLM using NVSHMEM demonstrates state-of-the-art performance and practical speedups (up to 47% reductions in job completion time in certain workloads). This framework enables rapid experimentation with scheduling strategies and paves the way for more adaptive, production-grade LLM serving systems.

Abstract

The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as prefill-decode disaggregation and context migration, arise in serving systems. Most inference services today expose a coarse-grained request-level API with a pre-configured coordination strategy, limiting the ability to customize and dynamically reconfigure the coordination. In this paper, we propose LLM microserving, a multi-level architecture for structuring and programming LLM inference services. We introduces simple yet effective microserving APIs to support fine-grained sub-request level actions. A programmable router transforms user requests into sub-request calls, enabling the dynamic reconfiguration of serving patterns. To support diverse execution patterns, we develop a unified KV cache interface that handles various KV compute, transfer, and reuse scenarios. Our evaluation shows that LLM microserving can be reconfigured to support multiple disaggregation orchestration strategies in a few lines of Python code while maintaining state-of-the-art performance for LLM inference tasks. Additionally, it allows us to explore new strategy variants that reduce up to 47% of job completion time compared to the existing strategies.

Paper Structure

This paper contains 23 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: LLM microserving System Overview. Our architecture enables dynamic reconfiguration of different orchestration strategies with a programmable router through three fine-grained REST APIs. The LLM microserving engines implement the APIs with a unified KV cache interface.
  • Figure 2: Data parallel via microserving
  • Figure 3: Prefill-decode disaggregation via microserving
  • Figure 4: Context cache-aware prefill-decode disaggregation via microserving
  • Figure 5: Context cache migration via microserving
  • ...and 8 more figures