MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan
TL;DR
MemServe presents a unified LLM-serving system that bridges inter-request context caching with intra-request disaggregated inference through an elastic MemPool and a locality-aware global scheduler. The MemPool provides memory, indexing, and distributed-transfer APIs to manage active and historical KV caches across a cluster, enabling seamless combinations of optimization techniques. A novel global prompt-tree scheduling policy guides request routing to maximize cache reuse and reduce data movement, supported by a cost model that predicts execution time to decide transfers versus recomputation. End-to-end evaluations on a DGX H800 show substantial improvements in JCT and TTFT when applying disaggregated inference with caching, especially for workloads with long prompts and high cache reuse potential. Overall, MemServe demonstrates a scalable approach to memory-centric LLM serving that effectively unifies previously separate optimization strategies.
Abstract
Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
