vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Ramya Prabhu; Ajay Nayak; Jayashree Mohan; Ramachandran Ramjee; Ashish Panwar

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

TL;DR

vAttention tackles the fragmentation and contiguity problems inherent to dynamic KV-cache memory management in LLM serving. By decoupling virtual and physical memory using CUDA VMM APIs and pre-reserving virtual KV buffers, it preserves virtual contiguity while enabling on-demand physical backing. The approach is augmented with latency-hiding and fragmentation-mitigating optimizations, including smaller page sizes and tensor-slicing alternatives, and is demonstrated to improve end-to-end throughput and decode performance across multiple models and back-ends, while remaining portable to FA3. The work reduces developer complexity compared to PagedAttention and delivers practical gains in real-world serving workloads. Overall, vAttention offers a principled, portable alternative to PagedAttention that simplifies integration of modern attention kernels while boosting serving throughput.

Abstract

PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation -- a phenomenon that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads. We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory. We achieve this by decoupling the allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels out-of-the-box and improves LLM serving throughput by up to 1.23x compared to the use of PagedAttention-based kernels of FlashAttention and FlashInfer.

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

TL;DR

Abstract

Paper Structure (61 sections, 2 equations, 16 figures, 10 tables, 1 algorithm)

This paper contains 61 sections, 2 equations, 16 figures, 10 tables, 1 algorithm.

Introduction
Background
Large Language Models
Fragmentation and PagedAttention
Issues with the PagedAttention Approach
Requires Re-writing the Attention Kernel
Adds Redundancy in the Serving Framework
Performance Overhead
Runtime overhead on the GPU
Runtime overhead on the CPU
Insights into LLM Serving Systems
vAttention: Design and Implementation
Design Overview
Pre-reserving virtual memory
Number of virtual memory buffers
...and 46 more sections

Figures (16)

Figure 1: PagedAttention involves two layers of memory management: one in user space and one in OS kernel space.
Figure 2: Overhead of PagedAttention in prefill kernels (model: Llama-3-8B, one A100 GPU). Numbers on top show overhead over the corresponding non-paged implementation of FlashAttention-2 (FA2) and FlashInfer (FI).
Figure 3: Latency of vLLM's paged decode kernel is sensitive to block size (model: Llama-3-8B, one A100 GPU).
Figure 4: Decode throughput (top) and the rate of physical memory allocation (bottom) saturate at large batch sizes.
Figure 5: Dynamic memory management in vAttention for a single K cache (or V cache) tensor. (a) shows a virtual tensor for a batch of two requests with no physical memory allocation yet. (b) R1 is allocated one physical page. (c) R1 is allocated two pages and R2 is allocated one page. (d) R1 has completed but vAttention does not reclaim its memory (deferred reclamation). (e) when R3 arrives, vAttention assigns R1's tensor to it which is already backed by physical memory.
...and 11 more figures

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

TL;DR

Abstract

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Authors

TL;DR

Abstract

Table of Contents

Figures (16)