Table of Contents
Fetching ...

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

TL;DR

MemServe presents a unified LLM-serving system that bridges inter-request context caching with intra-request disaggregated inference through an elastic MemPool and a locality-aware global scheduler. The MemPool provides memory, indexing, and distributed-transfer APIs to manage active and historical KV caches across a cluster, enabling seamless combinations of optimization techniques. A novel global prompt-tree scheduling policy guides request routing to maximize cache reuse and reduce data movement, supported by a cost model that predicts execution time to decide transfers versus recomputation. End-to-end evaluations on a DGX H800 show substantial improvements in JCT and TTFT when applying disaggregated inference with caching, especially for workloads with long prompts and high cache reuse potential. Overall, MemServe demonstrates a scalable approach to memory-centric LLM serving that effectively unifies previously separate optimization strategies.

Abstract

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

TL;DR

MemServe presents a unified LLM-serving system that bridges inter-request context caching with intra-request disaggregated inference through an elastic MemPool and a locality-aware global scheduler. The MemPool provides memory, indexing, and distributed-transfer APIs to manage active and historical KV caches across a cluster, enabling seamless combinations of optimization techniques. A novel global prompt-tree scheduling policy guides request routing to maximize cache reuse and reduce data movement, supported by a cost model that predicts execution time to decide transfers versus recomputation. End-to-end evaluations on a DGX H800 show substantial improvements in JCT and TTFT when applying disaggregated inference with caching, especially for workloads with long prompts and high cache reuse potential. Overall, MemServe demonstrates a scalable approach to memory-centric LLM serving that effectively unifies previously separate optimization strategies.

Abstract

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
Paper Structure (24 sections, 2 equations, 15 figures, 7 tables)

This paper contains 24 sections, 2 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: MemServe Architecture. It supports three types of inference instances: prefill-only, decode-only, and PD-colocated. Each inference engine runs over one or multiple AI servers, depending on the parallelism configuration.
  • Figure 2: MemPool Transfer API. The left shows the workflow of transfer and transfer_with_insert. The right shows asymmetric parallelism and memory medium.
  • Figure 3: Use Cases Enabled By MemPool. Circle 1 is context caching. Circle 2 is disaggregated inference. Circle 3 is sequence parallelism. Solid gray lines mean MemPool index API calls. Solid red lines mean MemPool distributed APIs. MemPool enables all use cases in one platform.
  • Figure 4: Enhancing Disaggregated Inference with Context Caching using MemPool APIs. The engine box means an adapted inference engine such as vLLM. Circled numbers mean steps taken to build the solution. A-KV is active KV cache. H-KV is historical KV cache.
  • Figure 5: Optimize Network&Memory for Disagg. Inference.
  • ...and 10 more figures