Table of Contents
Fetching ...

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar

TL;DR

POD-Attention addresses the bottleneck of attention in hybrid-batching LLM inference by introducing a single fused GPU kernel that concurrently computes prefill and decode attention to maximize compute and memory bandwidth utilization. The method relies on SM-aware CTA scheduling and targeted optimizations (tile sizes, concurrent CTAs per SM, virtual decode CTAs, and limiting prefill splits) to achieve co-location and balance resource contention. Empirical results show up to 59% speedups in attention time (mean 28%), up to 22% end-to-end throughput gains, and notable energy savings, alongside reduced TTFT and tail TBT in online inference. The approach significantly improves interactivity and efficiency for long-context LLM workloads, outperforming state-of-the-art baselines like Sarathi-Serve and vLLM across multiple models and workloads.

Abstract

Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for attention computation because existing attention kernels specialize execution independently for the prefill and decode phases. In this paper, we present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to $59\%$ (mean $28\%$), enabling higher throughput and lower latency LLM inference compared to the use of independently optimized prefill and decode attention kernels.

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

TL;DR

POD-Attention addresses the bottleneck of attention in hybrid-batching LLM inference by introducing a single fused GPU kernel that concurrently computes prefill and decode attention to maximize compute and memory bandwidth utilization. The method relies on SM-aware CTA scheduling and targeted optimizations (tile sizes, concurrent CTAs per SM, virtual decode CTAs, and limiting prefill splits) to achieve co-location and balance resource contention. Empirical results show up to 59% speedups in attention time (mean 28%), up to 22% end-to-end throughput gains, and notable energy savings, alongside reduced TTFT and tail TBT in online inference. The approach significantly improves interactivity and efficiency for long-context LLM workloads, outperforming state-of-the-art baselines like Sarathi-Serve and vLLM across multiple models and workloads.

Abstract

Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for attention computation because existing attention kernels specialize execution independently for the prefill and decode phases. In this paper, we present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to (mean ), enabling higher throughput and lower latency LLM inference compared to the use of independently optimized prefill and decode attention kernels.

Paper Structure

This paper contains 44 sections, 1 equation, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Impact of scheduling strategies on TTFT and TBT.
  • Figure 2: Computation in hybrid batches. Current systems compute prefill inputs ($e_1...e_p$) and decode inputs ($e_{p + 1}...e_{p+d}$) together for linear operations. However, they compute prefill and decode attention separately using specialized kernels.
  • Figure 3: Contribution of different operations in iteration runtime with hybrid batching (model: Llama-3-8B, batch size: 60, chunk size: 1K). For each context length, we show runtime of iteration that processes the last chunk of a prompt.
  • Figure 4: GPU execution model.
  • Figure 5: Per layer attention runtime of 32 hybrid batches corresponding to chunked prefills of a request of 16K tokens (chunk size: 512, model: Yi-6B, d_bs: decode batch size).
  • ...and 9 more figures