Table of Contents
Fetching ...

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

TL;DR

vAttention tackles the fragmentation and contiguity problems inherent to dynamic KV-cache memory management in LLM serving. By decoupling virtual and physical memory using CUDA VMM APIs and pre-reserving virtual KV buffers, it preserves virtual contiguity while enabling on-demand physical backing. The approach is augmented with latency-hiding and fragmentation-mitigating optimizations, including smaller page sizes and tensor-slicing alternatives, and is demonstrated to improve end-to-end throughput and decode performance across multiple models and back-ends, while remaining portable to FA3. The work reduces developer complexity compared to PagedAttention and delivers practical gains in real-world serving workloads. Overall, vAttention offers a principled, portable alternative to PagedAttention that simplifies integration of modern attention kernels while boosting serving throughput.

Abstract

PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation -- a phenomenon that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads. We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory. We achieve this by decoupling the allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels out-of-the-box and improves LLM serving throughput by up to 1.23x compared to the use of PagedAttention-based kernels of FlashAttention and FlashInfer.

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

TL;DR

vAttention tackles the fragmentation and contiguity problems inherent to dynamic KV-cache memory management in LLM serving. By decoupling virtual and physical memory using CUDA VMM APIs and pre-reserving virtual KV buffers, it preserves virtual contiguity while enabling on-demand physical backing. The approach is augmented with latency-hiding and fragmentation-mitigating optimizations, including smaller page sizes and tensor-slicing alternatives, and is demonstrated to improve end-to-end throughput and decode performance across multiple models and back-ends, while remaining portable to FA3. The work reduces developer complexity compared to PagedAttention and delivers practical gains in real-world serving workloads. Overall, vAttention offers a principled, portable alternative to PagedAttention that simplifies integration of modern attention kernels while boosting serving throughput.

Abstract

PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation -- a phenomenon that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads. We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory. We achieve this by decoupling the allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels out-of-the-box and improves LLM serving throughput by up to 1.23x compared to the use of PagedAttention-based kernels of FlashAttention and FlashInfer.
Paper Structure (61 sections, 2 equations, 16 figures, 10 tables, 1 algorithm)

This paper contains 61 sections, 2 equations, 16 figures, 10 tables, 1 algorithm.

Figures (16)

  • Figure 1: PagedAttention involves two layers of memory management: one in user space and one in OS kernel space.
  • Figure 2: Overhead of PagedAttention in prefill kernels (model: Llama-3-8B, one A100 GPU). Numbers on top show overhead over the corresponding non-paged implementation of FlashAttention-2 (FA2) and FlashInfer (FI).
  • Figure 3: Latency of vLLM's paged decode kernel is sensitive to block size (model: Llama-3-8B, one A100 GPU).
  • Figure 4: Decode throughput (top) and the rate of physical memory allocation (bottom) saturate at large batch sizes.
  • Figure 5: Dynamic memory management in vAttention for a single K cache (or V cache) tensor. (a) shows a virtual tensor for a batch of two requests with no physical memory allocation yet. (b) R1 is allocated one physical page. (c) R1 is allocated two pages and R2 is allocated one page. (d) R1 has completed but vAttention does not reclaim its memory (deferred reclamation). (e) when R3 arrives, vAttention assigns R1's tensor to it which is already backed by physical memory.
  • ...and 11 more figures