Table of Contents
Fetching ...

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Stuart H. Sul, Simran Arora, Benjamin F. Spector, Christopher Ré

TL;DR

ParallelKittens reframes multi-GPU kernel design as a small, principled space defined by transfer mechanisms, scheduling, and overheads, enabling high-performance overlap without operator-specific hacks. By introducing a minimal CUDA framework and the LCSC program template, PK achieves competitive or superior throughput compared with hand-optimized kernels and compiler-based approaches across data/tensor, sequence, and expert parallelism on Hopper and Blackwell GPUs. The work demonstrates concrete gains, including reductions in non-overlapped communication to single-digit percentages for certain workloads and up to 4.08× speedups for sequence-parallel tasks, while maintaining low code complexity (often <50 lines of device code). PK’s abstractions—tile-based data structures, targeted transfer primitives, and a unified runtime template—offer a practical path to scalable intra-node multi-GPU performance, with open-source availability for broader adoption and further inter-node extension as future work.

Abstract

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance$\unicode{x2014}$data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33 \times$ speedup for data- and tensor-parallel workloads, $4.08 \times$ for sequence-parallel workloads, and $1.22 \times$ for expert-parallel workloads.

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

TL;DR

ParallelKittens reframes multi-GPU kernel design as a small, principled space defined by transfer mechanisms, scheduling, and overheads, enabling high-performance overlap without operator-specific hacks. By introducing a minimal CUDA framework and the LCSC program template, PK achieves competitive or superior throughput compared with hand-optimized kernels and compiler-based approaches across data/tensor, sequence, and expert parallelism on Hopper and Blackwell GPUs. The work demonstrates concrete gains, including reductions in non-overlapped communication to single-digit percentages for certain workloads and up to 4.08× speedups for sequence-parallel tasks, while maintaining low code complexity (often <50 lines of device code). PK’s abstractions—tile-based data structures, targeted transfer primitives, and a unified runtime template—offer a practical path to scalable intra-node multi-GPU performance, with open-source availability for broader adoption and further inter-node extension as future work.

Abstract

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performancedata-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to speedup for data- and tensor-parallel workloads, for sequence-parallel workloads, and for expert-parallel workloads.

Paper Structure

This paper contains 47 sections, 4 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: We study the principles for high performance multi-GPU kernels and introduce ParallelKittens (PK), an opinionated collection of programming primitives to encapsulate these principles. The GPU memory hierarchy and corresponding PK abstractions are shown on the left (Section \ref{['sec:pk-data-structure']}), and the PK program template with its key multi-GPU kernel components is shown on the right (Section \ref{['sec:pk-program-template']}).
  • Figure 2: Observed memory bandwidth utilization for a 1 GB peer-to-peer transfer over NVLink. For device-initiated (TMA) transfers, the maximum supported message size is 227 KB; throughput values beyond this limit are held constant for visual comparison.
  • Figure 3: The number of SMs it takes to saturate NVLink Bandwidth, using different communication mechanisms.
  • Figure 4: GEMM reduce-scatter (RS) and all-reduce (AR) performance across overlapping schedules. Measured on $8\times$H100 GPUs with local GEMM shape $N \times N \times N/8$ ($N = 32768$) and element type BF16.
  • Figure 5: Comparison of different inter-SM scheduling performance on all-gather (AG) GEMM ($N \times N/8 \times N$).
  • ...and 17 more figures