Table of Contents
Fetching ...

Task-Based Tensor Computations on Modern GPUs

Rohan Yadav, Michael Garland, Alex Aiken, Michael Bauer

TL;DR

This paper presents Cypress, a task-based programming model with sequential semantics designed to exploit asynchronous fixed-function units on modern GPUs like NVIDIA's Hopper. By separating a computation’s logical description from a machine-specific mapping, Cypress encodes data movement and synchronization decisions in a compiler-driven mapping, enabling warp-specialized code generation that leverages the Tensor Core and Tensor Memory Accelerator without burdening the programmer with low-level details. The Cypress compiler pipeline includes dependence analysis, vectorization, copy elimination, resource allocation, and warp specialization, ultimately lowering to CUDA C++ with static scheduling. Evaluation across GEMM, fused GEMM+Reduction, and Flash Attention shows Cypress achieving near-peak performance compared to cuBLAS/cuDNN and outperforming Triton in several cases, demonstrating the practical viability of automated asynchrony management and mapping-driven optimization for Hopper-style GPUs.

Abstract

Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called \emph{tasks} that operate on \emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a \emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.

Task-Based Tensor Computations on Modern GPUs

TL;DR

This paper presents Cypress, a task-based programming model with sequential semantics designed to exploit asynchronous fixed-function units on modern GPUs like NVIDIA's Hopper. By separating a computation’s logical description from a machine-specific mapping, Cypress encodes data movement and synchronization decisions in a compiler-driven mapping, enabling warp-specialized code generation that leverages the Tensor Core and Tensor Memory Accelerator without burdening the programmer with low-level details. The Cypress compiler pipeline includes dependence analysis, vectorization, copy elimination, resource allocation, and warp specialization, ultimately lowering to CUDA C++ with static scheduling. Evaluation across GEMM, fused GEMM+Reduction, and Flash Attention shows Cypress achieving near-peak performance compared to cuBLAS/cuDNN and outperforming Triton in several cases, demonstrating the practical viability of automated asynchrony management and mapping-driven optimization for Hopper-style GPUs.

Abstract

Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called \emph{tasks} that operate on \emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a \emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.

Paper Structure

This paper contains 37 sections, 2 equations, 14 figures.

Figures (14)

  • Figure 1: High-level GEMM computation structure on Ampere and Hopper GPUs.
  • Figure 2: H100 Machine Model
  • Figure 3: Abstract syntax for Cypress programs.
  • Figure 4: Output matrix layout in registers for M=64,N=n*8 warpgroup matrix multiplication instruction (adapted from NVIDIA PTX documentation).
  • Figure 5: H100 GEMM implementation developed in Cypress. The logical description expresses the decomposition of computation and data, and the mapping binds tasks and tensors to physical processors and memories. Numbered bracket annotations indicate components of the logical description controlled by the mapping specification. Communication and synchronization are notably absent from the logical description.
  • ...and 9 more figures