Table of Contents
Fetching ...

HipKittens: Fast and Furious AMD Kernels

William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, Simran Arora

TL;DR

This paper tackles the challenge of delivering high-performance AI kernels on AMD GPUs by introducing HipKittens (HK), a minimal tile-based programming framework that portably expresses kernels across vendors. HK preserves the tile-based abstractions of prior DSLs while adapting memory access, register management, and cache-aware tiling to AMD CDNA3/CDNA4 architectures, including explicit register pinning and chiplet-aware swizzling. Across GEMM, MHA, and attention workloads, HK achieves performance competitive with AMD's hand-optimized assembly and outperforms compiler-based baselines, underscoring that a small, principled set of abstractions can yield high performance on diverse hardware. The work also identifies robust scheduling patterns, notably 8-wave ping-pong and 4-wave interleave, and demonstrates how cache and memory layouts interact with chiplet architectures to unlock throughput, paving the way for cross-vendor tile-based AI kernels.

Abstract

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.

HipKittens: Fast and Furious AMD Kernels

TL;DR

This paper tackles the challenge of delivering high-performance AI kernels on AMD GPUs by introducing HipKittens (HK), a minimal tile-based programming framework that portably expresses kernels across vendors. HK preserves the tile-based abstractions of prior DSLs while adapting memory access, register management, and cache-aware tiling to AMD CDNA3/CDNA4 architectures, including explicit register pinning and chiplet-aware swizzling. Across GEMM, MHA, and attention workloads, HK achieves performance competitive with AMD's hand-optimized assembly and outperforms compiler-based baselines, underscoring that a small, principled set of abstractions can yield high performance on diverse hardware. The work also identifies robust scheduling patterns, notably 8-wave ping-pong and 4-wave interleave, and demonstrates how cache and memory layouts interact with chiplet architectures to unlock throughput, paving the way for cross-vendor tile-based AI kernels.

Abstract

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by (e.g., attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.

Paper Structure

This paper contains 59 sections, 1 equation, 24 figures, 5 tables.

Figures (24)

  • Figure 1: We study whether existing tile based programming primitives suffice for AMD kernels, or whether entirely new primitives are needed. Our study led to HipKittens: a minimal and opinionated set of primitives for fast and furious AMD kernels. HK introduces a general 8-wave ping-pong schedule to overlap compute and memory, programmer controlled register allocation, and efficient shared memory and chiplet-aware swizzling algorithms to enable a suite of high performance AMD AI kernels.
  • Figure 2: Hardware overview. (Left) Peak memory and compute speeds for the latest generation GPU platforms amd2025mi355x25-blackwell. (Right) Diagram of the AMD GPU software and hardware hierarchy.
  • Figure 3: Matrix layouts on NVIDIA and AMD GPUs. The shaded cells in each matrix represent elements owned by thread 0.
  • Figure 4: Swizzle pattern for a 16x32 tile of BF16s. Shared memory on AMD CDNA4 GPUs have different banking behavior depending on the instruction. ds_read_b128 accesses shared memory through 64 banks, each 32-bits wide, and correspond the individual cells and numbers in the figure. The shaded cells represent banks that are accessed by the first phase of a ds_read_b128 instruction for a 16x32 row layout register tile. On the left is an unswizzled layout suffering from 2-way bank conflicts. On the right is a swizzled layout with no bank conflicts. The swizzle applied here swaps the first 8 columns with the last 8 starting from the 8th row. This swizzling strategy simultaneously enables bank-conflict free accesses from column-major reads using ds_read_b64_tr_b16. Details can be found in \ref{['app:layouts']}.
  • Figure 5: GEMM. We compare HK BF16 and FP8 GEMMs to the strongest available baselines.
  • ...and 19 more figures