ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Stuart H. Sul, Simran Arora, Benjamin F. Spector, Christopher Ré
TL;DR
ParallelKittens reframes multi-GPU kernel design as a small, principled space defined by transfer mechanisms, scheduling, and overheads, enabling high-performance overlap without operator-specific hacks. By introducing a minimal CUDA framework and the LCSC program template, PK achieves competitive or superior throughput compared with hand-optimized kernels and compiler-based approaches across data/tensor, sequence, and expert parallelism on Hopper and Blackwell GPUs. The work demonstrates concrete gains, including reductions in non-overlapped communication to single-digit percentages for certain workloads and up to 4.08× speedups for sequence-parallel tasks, while maintaining low code complexity (often <50 lines of device code). PK’s abstractions—tile-based data structures, targeted transfer primitives, and a unified runtime template—offer a practical path to scalable intra-node multi-GPU performance, with open-source availability for broader adoption and further inter-node extension as future work.
Abstract
Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance$\unicode{x2014}$data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33 \times$ speedup for data- and tensor-parallel workloads, $4.08 \times$ for sequence-parallel workloads, and $1.22 \times$ for expert-parallel workloads.
