Table of Contents
Fetching ...

ThunderKittens: Simple, Fast, and Adorable AI Kernels

Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher Ré

TL;DR

AI kernel development faces a GPU-mapping bottleneck where hand-tuned kernels fail to reach theoretical performance. ThunderKittens (TK) introduces three abstractions—tile-based data structures, a generalized asynchronous LCSF template, and grid-level scheduling—to simplify kernel development while maintaining high performance. TK kernels match or surpass CuBLAS and FlashAttention-3 on key tasks (GEMM, attention) and exceed baselines on backward passes, state-space models, and linear attention, using a small, embeddable C++ framework. The work demonstrates that a compact, well-chosen set of abstractions can unlock broad, high-performance GPU kernels and provides an open-source path for adoption and extension.

Abstract

The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse hardware capabilities of GPUs might suggest that we need a wide variety of techniques to achieve high performance. However, our work explores whether a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like parallel compute operations over tiles, (2) at the thread-block level, we provide a template for overlapping asynchronous operations across parallel warps, and (3) at the grid-level, we provide support to help hide the block launch and tear-down, and memory costs. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by $10-40\%$ on attention backwards, $8\times$ on state space models, and $14\times$ on linear attention.

ThunderKittens: Simple, Fast, and Adorable AI Kernels

TL;DR

AI kernel development faces a GPU-mapping bottleneck where hand-tuned kernels fail to reach theoretical performance. ThunderKittens (TK) introduces three abstractions—tile-based data structures, a generalized asynchronous LCSF template, and grid-level scheduling—to simplify kernel development while maintaining high performance. TK kernels match or surpass CuBLAS and FlashAttention-3 on key tasks (GEMM, attention) and exceed baselines on backward passes, state-space models, and linear attention, using a small, embeddable C++ framework. The work demonstrates that a compact, well-chosen set of abstractions can unlock broad, high-performance GPU kernels and provides an open-source path for adoption and extension.

Abstract

The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse hardware capabilities of GPUs might suggest that we need a wide variety of techniques to achieve high performance. However, our work explores whether a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like parallel compute operations over tiles, (2) at the thread-block level, we provide a template for overlapping asynchronous operations across parallel warps, and (3) at the grid-level, we provide support to help hide the block launch and tear-down, and memory costs. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by on attention backwards, on state space models, and on linear attention.

Paper Structure

This paper contains 43 sections, 1 equation, 19 figures, 5 tables.

Figures (19)

  • Figure 1: ThunderKittens explores whether a small set of abstractions can support efficient and simple AI kernels. Inspired by PyTorch, our abstractions include tiles with managed layouts and operations over tiles. We provide a general program template for coordinating asynchronous parallel workers -- e.g., workers that load from and store data to HBM, while other workers perform computation in fast memory.
  • Figure 2: Attention implemented in TK using familiar PyTorch-like operations on tiles.
  • Figure 3: The software (and physical) GPU hierarchy.
  • Figure 4: Shared memory bank layouts, illustrated for a 16x64 16-bit tile. Top left: A naive, row-major layout. Although loading rows is efficient, loading into a tensor core layout suffers 8-way bank conflicts. Top right: A padded layout, which has no bank conflicts but consumes additional memory and has poor hardware support. Bottom: Two of TK's three chosen layouts, with compile-time selection based on width. (Bank conflicts are unavoidable for some tile sizes while maintaining good hardware support.) These layouts have 2-way and no bank conflicts, respectively.
  • Figure 5: An attention kernel within the LCSF template. The left shows the functionality for workers focused on managing HBM to SRAM memory movement, and the right shows the functionality for parallel compute workers that operate in fast memory, registers and SRAM.
  • ...and 14 more figures