Table of Contents
Fetching ...

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Jiesheng Wu, Feng Lyu

TL;DR

Memory-bound bottlenecks in LLM inference arise from KV Cache loads over off-chip memory. The authors present an L2-cache-oriented asynchronous KV Cache prefetching strategy that overlaps compute with memory accesses, using Hopper-based GPUs to prefetch KV data into L2 for fast hits. Empirical results on NVIDIA H20 show up to 2.15x kernel acceleration and up to 1.97x end-to-end throughput, outperforming FlashAttention-3 and existing baselines while remaining compatible with current frameworks. This latency-hiding approach offers a scalable enhancement to next-generation LLM inference engines and can be combined with other optimization techniques for broader gains.

Abstract

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

TL;DR

Memory-bound bottlenecks in LLM inference arise from KV Cache loads over off-chip memory. The authors present an L2-cache-oriented asynchronous KV Cache prefetching strategy that overlaps compute with memory accesses, using Hopper-based GPUs to prefetch KV data into L2 for fast hits. Empirical results on NVIDIA H20 show up to 2.15x kernel acceleration and up to 1.97x end-to-end throughput, outperforming FlashAttention-3 and existing baselines while remaining compatible with current frameworks. This latency-hiding approach offers a scalable enhancement to next-generation LLM inference engines and can be combined with other optimization techniques for broader gains.

Abstract

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.

Paper Structure

This paper contains 23 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Schematic illustration of Stall Long Scoreboard event in NVIDIA GPU.
  • Figure 2: $Q \cdot K^T$ computation flow in a single iteration for native XFormers and the proposed method, illustrated with a thread block configuration containing 4 warps.
  • Figure 3: Single-GPU end-to-end inference throughput comparison across backends with fixed 2048 output tokens on H20.
  • Figure 4: Speedups of the proposed method over native XFormers across varying batch sizes and output sequence lengths, evaluated on a single NVIDIA H20 GPU.
  • Figure 5: Multi-GPU end-to-end inference throughput comparison across backends under fixed 4096 output tokens with batch size 64, benchmarked on NVIDIA H20 GPUs.