Table of Contents
Fetching ...

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

Mansi Choudhary, Karthik Sangaiah, Sonali Singh, Muhammad Osama, Lisa Wu Wills, Ganesh Dasika

TL;DR

The paper addresses the performance bottlenecks of large-scale attention on disaggregated, chiplet-based GPUs by analyzing NUMA-induced memory locality issues. It introduces Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with NUMA domains to maximize per-XCD cache reuse, particularly leveraging FA2's data-sharing patterns. On AMD's MI300X, this approach yields up to 50% faster attention performance and sustains high L2 cache hit rates (80-97%), with minimal code changes. The findings argue that NUMA-aware kernel design is essential for scalable AI training and inference on next-generation disaggregated GPUs, offering a practical path to improved efficiency in large models and long-context workloads.

Abstract

The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA effects distort locality in multi-head attention (MHA) and present Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit intra-chiplet cache reuse. On AMD's MI300X architecture, our method achieves up to 50% higher performance over state-of-the-art attention algorithms using conventional scheduling techniques and sustains consistently high L2 cache hit rates of 80-97%. These results demonstrate that NUMA-aware scheduling is now fundamental to achieving full efficiency on next-generation disaggregated GPUs, offering a path forward for scalable AI training and inference.

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

TL;DR

The paper addresses the performance bottlenecks of large-scale attention on disaggregated, chiplet-based GPUs by analyzing NUMA-induced memory locality issues. It introduces Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with NUMA domains to maximize per-XCD cache reuse, particularly leveraging FA2's data-sharing patterns. On AMD's MI300X, this approach yields up to 50% faster attention performance and sustains high L2 cache hit rates (80-97%), with minimal code changes. The findings argue that NUMA-aware kernel design is essential for scalable AI training and inference on next-generation disaggregated GPUs, offering a practical path to improved efficiency in large models and long-context workloads.

Abstract

The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA effects distort locality in multi-head attention (MHA) and present Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit intra-chiplet cache reuse. On AMD's MI300X architecture, our method achieves up to 50% higher performance over state-of-the-art attention algorithms using conventional scheduling techniques and sustains consistently high L2 cache hit rates of 80-97%. These results demonstrate that NUMA-aware scheduling is now fundamental to achieving full efficiency on next-generation disaggregated GPUs, offering a path forward for scalable AI training and inference.

Paper Structure

This paper contains 21 sections, 2 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Evolution of GPU architectures toward disaggregated memory hierarchies. (a) Traditional single-die GPU (e.g., NVIDIA A100 nvidia_a100, H100 nvidia_h100; AMD MI200 series amd_mi200) with unified L2 cache shared across all compute units (CUs), providing uniform memory access. (b) Dual-die chiplet architecture (e.g., NVIDIA Blackwell nvidia_blackwell, Rubin nvidia_rubin) with interconnects between dies. (c) Quad-die chiplet architecture (e.g., NVIDIA Rubin Ultra nvidia_rubin_ultra, AMD MI300 amd_mi300 series) with further disaggregation. Each die has dedicated compute units, L2 cache, and memory controllers connected to HBM stacks. While AMD and NVIDIA both employ multi-die designs, the degree to which NUMA effects are exposed to software varies by implementation. NVIDIA's Blackwell maintains full cache coherency between the dies, abstracting the NUMA effects at the hardware level, whereas AMD's MI300X explicitly exposes NUMA characteristics, enabling architecture-aware optimizations.
  • Figure 2: Impact of workgroup scheduling on cache reuse in the multi-die chiplet architecture of AMD MI300XTM. When workgroups processing related tiles that share input data are scheduled to different dies (left: CU0 on Die 0, right: CU2 on Die 1), they cannot benefit from each other's cached data. This cross-die scheduling forces redundant memory fetches from HBM through the shared last-level cache (LLC), as L2 caches are private to each Accelerator Complex Die or XCD. The NUMA effects inherent to this disaggregated architecture make spatially aware workgroup placement critical for maximizing cache efficiency and minimizing memory bandwidth consumption.
  • Figure 3: Chiplet-aware workgroup ID remapping in Triton. The function transforms the original workgroup ID by determining the target XCD and calculating the new global workgroup ID to achieve spatial locality within chiplet memory domains.
  • Figure 4: FlashAttention2 Tiled Compute Partitioning Across Workgroups for a Single Attention Head. Query matrix $Q$ is partitioned into row blocks (BLOCK_M), with each workgroup (pid0-pid3) processing one row block. Each workgroup accesses the complete key matrix $K^T$ and value matrix $V$ to compute attention scores $S = QK^T$, apply softmax to obtain attention weights $P$, and produce output $O = PV$. All workgroups within the same attention head share access to the same $K$ and $V$ tensors, creating natural spatial locality patterns. Matrix dimensions are shown in parentheses, where N_CTX is the context length and HEAD_DIM is the head dimension.
  • Figure 5: Attention Algorithm Grid for $Q$, $K$, $V$, $O$ tensors. Z$=$batch size, H$=$# of attention heads, sized by query context length (N_CTX) and head dimension (HEAD_DIM).
  • ...and 11 more figures