Task-Based Tensor Computations on Modern GPUs
Rohan Yadav, Michael Garland, Alex Aiken, Michael Bauer
TL;DR
This paper presents Cypress, a task-based programming model with sequential semantics designed to exploit asynchronous fixed-function units on modern GPUs like NVIDIA's Hopper. By separating a computation’s logical description from a machine-specific mapping, Cypress encodes data movement and synchronization decisions in a compiler-driven mapping, enabling warp-specialized code generation that leverages the Tensor Core and Tensor Memory Accelerator without burdening the programmer with low-level details. The Cypress compiler pipeline includes dependence analysis, vectorization, copy elimination, resource allocation, and warp specialization, ultimately lowering to CUDA C++ with static scheduling. Evaluation across GEMM, fused GEMM+Reduction, and Flash Attention shows Cypress achieving near-peak performance compared to cuBLAS/cuDNN and outperforming Triton in several cases, demonstrating the practical viability of automated asynchrony management and mapping-driven optimization for Hopper-style GPUs.
Abstract
Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called \emph{tasks} that operate on \emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a \emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.
