Stateful Large Language Model Serving with Pensieve

Lingfan Yu; Jinkun Lin; Jinyang Li

Stateful Large Language Model Serving with Pensieve

Lingfan Yu, Jinkun Lin, Jinyang Li

TL;DR

<3-5 sentence high-level summary> Pensieve tackles the inefficiency of stateless LLM serving in multi-turn conversations by introducing a stateful two-tier GPU-CPU KV cache that preserves and reuses processed context across requests. It couples this caching strategy with a novel multi-token attention kernel that can operate over non-contiguous KV memory, and a unified batch scheduler that processes prefill and generation phases together. The approach includes chunk-based eviction with ahead-of-time swapping and pipelined recovery, plus mechanisms to handle dropped tokens via recomputation. Empirical results show substantial throughput gains (up to 3.0x) over strong stateless baselines on large models, with notable latency reductions and improved scalability on multi-GPU deployments.

Abstract

Large Language Models (LLMs) are wildly popular today and it is important to serve them efficiently. Existing LLM serving systems are stateless across requests. Consequently, when LLMs are used in the common setting of multi-turn conversations, a growing log of the conversation history must be processed alongside any request by the serving system at each turn, resulting in repeated processing. In this paper, we design $Pensieve$, a system optimized for multi-turn conversation LLM serving. $Pensieve$ maintains the conversation state across requests by caching previously processed history to avoid duplicate processing. $Pensieve$'s multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data. $Pensieve$ also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that $Pensieve$ can achieve $1.14$-$3.0\times$ the throughput of vLLM and TensorRT-LLM and significantly reduce latency.

Stateful Large Language Model Serving with Pensieve

TL;DR

Abstract

, a system optimized for multi-turn conversation LLM serving.

maintains the conversation state across requests by caching previously processed history to avoid duplicate processing.

's multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data.

also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that

can achieve

the throughput of vLLM and TensorRT-LLM and significantly reduce latency.

Paper Structure (50 sections, 3 equations, 15 figures, 3 tables)

This paper contains 50 sections, 3 equations, 15 figures, 3 tables.

Introduction
Background
LLM and the Attention Mechanism
How LLM is Served
The prefill vs. generation phase
Iteration-level batching
Memory management.
Motivation and Challenges
Motivation
Challenges
Limited GPU memory for caching.
Token-level cache management and recovery
Handling non-contiguous KV cache.
System Design
System Overview
...and 35 more sections

Figures (15)

Figure 1: How inference is done for Transformer-based LLMs
Figure 2: Existing serving systems process a cumulative history repeatedly with each request in a multi-turn conversation.
Figure 3: Execution time for a batch of 32 requests performing prompt (32 tokens) prefill and generations for 200 steps.
Figure 4: Execution time of attention operation for a chunk of 32 tokens with different context sizes. Results are normalized by the execution time of non-attention operations in a transformer layer.
Figure 5: Layout of a typical request's KV-token context. The shaded areas, which occur at both ends of the context, mark those tokens that must be processed by the prefill phase.
...and 10 more figures

Stateful Large Language Model Serving with Pensieve

TL;DR

Abstract

Stateful Large Language Model Serving with Pensieve

Authors

TL;DR

Abstract

Table of Contents

Figures (15)