Table of Contents
Fetching ...

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, Michael Bauer

TL;DR

Twill introduces a constraint-based joint optimization for software pipelining and warp specialization to automatically derive optimal schedules for Tensor Core GPUs. By recasting SWP as modulo scheduling and integrating WS through an SMT-based framework, Twill delivers optimal schedules across Hopper and Blackwell without relying on hand-tuned heuristics. The approach achieves competitive performance with state-of-the-art hand-optimized implementations (e.g., FA3/FA4) on forward and backward passes of attention, validating the practicality of automated joint optimization. This work provides a scalable, architecture-agnostic foundation for high-performance tile-based GPU programming and highlights the importance of holistic scheduling in leveraging fixed-function units.

Abstract

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

TL;DR

Twill introduces a constraint-based joint optimization for software pipelining and warp specialization to automatically derive optimal schedules for Tensor Core GPUs. By recasting SWP as modulo scheduling and integrating WS through an SMT-based framework, Twill delivers optimal schedules across Hopper and Blackwell without relying on hand-tuned heuristics. The approach achieves competitive performance with state-of-the-art hand-optimized implementations (e.g., FA3/FA4) on forward and backward passes of attention, validating the practicality of automated joint optimization. This work provides a scalable, architecture-agnostic foundation for high-performance tile-based GPU programming and highlights the importance of holistic scheduling in leveraging fixed-function units.

Abstract

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

Paper Structure

This paper contains 32 sections, 5 equations, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: Modulo scheduling a simplified Flash Attention expressed in a tile-based manner. The machine costs are for Hopper, where GEMM and EXP on a tile have roughly the same cost. Modulo scheduling recovers the Flash Attention 3 flash-attention-3 pipeline.
  • Figure 2: Visualization of three operations using different functional units scheduled on the same warp. The blocking sync after GEMM interrupts the concurrent issue of EXP.
  • Figure 3: Straight-line code analyzed by Twill's joint formulation. Sample $\operatorname{op}$ table entries are to the left.
  • Figure 4: Constraints enforcing a valid modulo schedule.
  • Figure 5: Memory Allocation Constraints.
  • ...and 6 more figures