vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar
TL;DR
vAttention tackles the fragmentation and contiguity problems inherent to dynamic KV-cache memory management in LLM serving. By decoupling virtual and physical memory using CUDA VMM APIs and pre-reserving virtual KV buffers, it preserves virtual contiguity while enabling on-demand physical backing. The approach is augmented with latency-hiding and fragmentation-mitigating optimizations, including smaller page sizes and tensor-slicing alternatives, and is demonstrated to improve end-to-end throughput and decode performance across multiple models and back-ends, while remaining portable to FA3. The work reduces developer complexity compared to PagedAttention and delivers practical gains in real-world serving workloads. Overall, vAttention offers a principled, portable alternative to PagedAttention that simplifies integration of modern attention kernels while boosting serving throughput.
Abstract
PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation -- a phenomenon that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads. We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory. We achieve this by decoupling the allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels out-of-the-box and improves LLM serving throughput by up to 1.23x compared to the use of PagedAttention-based kernels of FlashAttention and FlashInfer.
