Table of Contents
Fetching ...

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, Jingwen Leng

TL;DR

vTensor introduces a GPU virtual memory abstraction to decouple memory defragmentation from computation in LLM serving. FlexInfer combines a CPU–GPU scheduler with the vTensor Manager to enable fragmentation-free KV-cache management while preserving kernel-level performance across GQA/MQA and prefix-cache regimes. Across multiple Yi-series models and end-to-end scenarios, FlexInfer achieves about 1.86× average end-to-end throughput improvement (up to 2.4× in some cases) and up to 3.92× kernel speedups, while freeing roughly 71% of GPU memory on an A100, enabling more memory-intensive workloads. The work demonstrates a practical path toward scalable, cost-efficient LLM serving by unifying memory management with computation via virtual memory and careful scheduling.

Abstract

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

TL;DR

vTensor introduces a GPU virtual memory abstraction to decouple memory defragmentation from computation in LLM serving. FlexInfer combines a CPU–GPU scheduler with the vTensor Manager to enable fragmentation-free KV-cache management while preserving kernel-level performance across GQA/MQA and prefix-cache regimes. Across multiple Yi-series models and end-to-end scenarios, FlexInfer achieves about 1.86× average end-to-end throughput improvement (up to 2.4× in some cases) and up to 3.92× kernel speedups, while freeing roughly 71% of GPU memory on an A100, enabling more memory-intensive workloads. The work demonstrates a practical path toward scalable, cost-efficient LLM serving by unifying memory management with computation via virtual memory and careful scheduling.

Abstract

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.
Paper Structure (53 sections, 12 figures, 1 table)

This paper contains 53 sections, 12 figures, 1 table.

Figures (12)

  • Figure 1: Three KV cache memory management strategies: (a) Native KV cache with native GPU allocation has a large volume of fragmentation; (b) vLLM adopted paged memory management to eliminate most fragmentation but with tightly coupled kernels; and (c) vTensor decouples computation and memory allocation with more flexible management.
  • Figure 2: GPU memory usage breakdown using FlashAttention FlashAttention (Native), vLLM vLLM, and FlexInfer on GPU A100 with 80 GB memory.
  • Figure 3: Roofline Model for LLM Attention on GPU A100.
  • Figure 4: Overview of FlexInfer serving framework.
  • Figure 5: The design of vTensor.
  • ...and 7 more figures