Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Bin Gao; Zhuomin He; Puru Sharma; Qingxuan Kang; Djordje Jevdjic; Junbo Deng; Xingkun Yang; Zhou Yu; Pengfei Zuo

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo

TL;DR

CachedAttention tackles the high cost of multi-turn LLM serving by reusing key-value caches across conversation turns. It introduces AttentionStore, a hierarchical KV caching system with layer-wise pre-loading, asynchronous saving, and scheduler-aware fetching/eviction, plus decoupled positional encoding to handle context window overflow. Empirical results on real datasets show dramatic improvements: Time-to-First Token can drop by up to $87\%$, prefilling throughput by up to $7.8\times$, and end-to-end inference cost by up to $70\%$, across multiple models. This approach enables scalable, cost-efficient LLM serving for long-running dialogues by leveraging memory hierarchies beyond on-GPU caches. The combination of overlapped KV access, storage-tier placement, and truncation-safe KV caching offers practical benefits for deployment in production environments with high-throughput, low-latency requirements.

Abstract

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs. To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV caches to be accessed are placed in the fastest hierarchy, CachedAttention employs scheduler-aware fetching and eviction schemes to consciously place the KV caches in different layers based on the hints from the inference job scheduler. To avoid the invalidation of the saved KV caches incurred by context window overflow, CachedAttention enables the saved KV caches to remain valid via decoupling the positional encoding and effectively truncating the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8$\times$ for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

TL;DR

, prefilling throughput by up to

, and end-to-end inference cost by up to

, across multiple models. This approach enables scalable, cost-efficient LLM serving for long-running dialogues by leveraging memory hierarchies beyond on-GPU caches. The combination of overlapped KV access, storage-tier placement, and truncation-safe KV caching offers practical benefits for deployment in production environments with high-throughput, low-latency requirements.

Abstract

for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.

Paper Structure (29 sections, 2 equations, 25 figures, 2 tables)

This paper contains 29 sections, 2 equations, 25 figures, 2 tables.

Introduction
Background and Motivation
Generative LLM Inference Basics
Autoregressive Generation
Multi-turn Conversation Inference
Opportunities and Challenges
The CachedAttention Design
Overview
Overlapped KV Cache Access
Layer-wise Pre-loading from Memory to HBMs
Asynchronous Saving from HBMs to Memory
Hierarchical KV Cache Placement
Scheduler-aware Fetching from Disks to Memory
Scheduler-aware Eviction from Memory to Disks
Decoupled KV Cache Truncation
...and 14 more sections

Figures (25)

Figure 1: Prefilling and decoding phases. Latency measured for LLaMA-70B of batch size 8 on 4 A100 GPUs.
Figure 2: (a) Distribution for conversation turn number in ShareGPT sharegpt_sharegptraw_2024. (b) The session length distribution of ShareGPT. For better display effect, the statistics exclude conversations with over 40 turns or sessions that exceed a length of 32K.
Figure 3: Comparison of recomputation and CachedAttention.
Figure 4: Recomputation inefficiencies. (a) Average numbers of historical tokens and new tokens in different turns of ShareGPT sharegpt_sharegptraw_2024. (b) GPU time for prefilling all tokens and only new input tokens in ShareGPT with Mistral-7B jiang2023mistral on 1 A100 GPU.
Figure 5: The system architecture of CachedAttention.
...and 20 more figures

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

TL;DR

Abstract

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Authors

TL;DR

Abstract

Table of Contents

Figures (25)