Table of Contents
Fetching ...

Modular GPU Programming with Typed Perspectives

Manya Bansal, Daniel Sainati, Joseph W. Cutler, Saman Amarasinghe, Jonathan Ragan-Kelley

TL;DR

Modern GPU programming suffers from a tension between per-thread control and collective execution, which undermines modularity. Prism tackles this by introducing typed perspectives that statically enforce correct resource usage while preserving low-level control, with Bundl providing a formal foundation and proofs of safety. The implementation demonstrates that Prism can express CUDA-like patterns and achieve performance comparable to hand-optimized CUDA kernels on devices such as the H100 and 4070 Ti, including advanced features like Tensor Cores and asynchronous data movement. Together, Prism and Bundl offer a practical path to safe, composable, high-performance GPU kernels, with clear potential for broader architectural support and data-race guarantees in future work.

Abstract

To achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to individual threads to control their behavior, while simultaneously tracking the convergence of many threads acting in concert to perform collective operations like Tensor Core instructions. The tension between these two mindsets makes modular programming error prone. Functions that encapsulate collective operations, despite being called per-thread, must be executed cooperatively by groups of threads. In this work, we introduce Prism, a new GPU language that restores modularity while still giving programmers the low-level control over collective operations necessary for high performance. Our core idea is typed perspectives, which materialize, at the type level, the granularity at which the programmer is controlling the behavior of threads. We describe the design of Prism, implement a compiler for it, and lay its theoretical foundations in a core calculus called Bundl. We implement state-of-the-art GPU kernels in Prism and find that it offers programmers the safety guarantees needed to confidently write modular code without sacrificing performance.

Modular GPU Programming with Typed Perspectives

TL;DR

Modern GPU programming suffers from a tension between per-thread control and collective execution, which undermines modularity. Prism tackles this by introducing typed perspectives that statically enforce correct resource usage while preserving low-level control, with Bundl providing a formal foundation and proofs of safety. The implementation demonstrates that Prism can express CUDA-like patterns and achieve performance comparable to hand-optimized CUDA kernels on devices such as the H100 and 4070 Ti, including advanced features like Tensor Cores and asynchronous data movement. Together, Prism and Bundl offer a practical path to safe, composable, high-performance GPU kernels, with clear potential for broader architectural support and data-race guarantees in future work.

Abstract

To achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to individual threads to control their behavior, while simultaneously tracking the convergence of many threads acting in concert to perform collective operations like Tensor Core instructions. The tension between these two mindsets makes modular programming error prone. Functions that encapsulate collective operations, despite being called per-thread, must be executed cooperatively by groups of threads. In this work, we introduce Prism, a new GPU language that restores modularity while still giving programmers the low-level control over collective operations necessary for high performance. Our core idea is typed perspectives, which materialize, at the type level, the granularity at which the programmer is controlling the behavior of threads. We describe the design of Prism, implement a compiler for it, and lay its theoretical foundations in a core calculus called Bundl. We implement state-of-the-art GPU kernels in Prism and find that it offers programmers the safety guarantees needed to confidently write modular code without sacrificing performance.

Paper Structure

This paper contains 69 sections, 9 theorems, 94 equations, 18 figures.

Key Result

theorem 1

(Type-and-Perspective Safety). For any program $s$ such that $\Gamma \vdash^\pi s$, either:

Figures (18)

  • Figure 1: Warp-level Tensor Core instruction in CUDA.
  • Figure 2: Warp-level Tensor Core instruction in Prism.
  • Figure 3: The @ syntax represents the frequency at which a variable varies across threads T1-T8. Each @ denotes a coloring of the variable; threads that "view" the variable with the same color must observe the same value.
  • Figure 4: Invoking a warp-level Tensor Core instruction in CUDA after loading data from memory.
  • Figure 6: Naive tensor float 32 matrix multiplication in Prism (full program can be found in Appendix \ref{['sec:tf32-func']}).
  • ...and 13 more figures

Theorems & Definitions (9)

  • theorem 1
  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • lemma 7
  • lemma 8