Modular GPU Programming with Typed Perspectives
Manya Bansal, Daniel Sainati, Joseph W. Cutler, Saman Amarasinghe, Jonathan Ragan-Kelley
TL;DR
Modern GPU programming suffers from a tension between per-thread control and collective execution, which undermines modularity. Prism tackles this by introducing typed perspectives that statically enforce correct resource usage while preserving low-level control, with Bundl providing a formal foundation and proofs of safety. The implementation demonstrates that Prism can express CUDA-like patterns and achieve performance comparable to hand-optimized CUDA kernels on devices such as the H100 and 4070 Ti, including advanced features like Tensor Cores and asynchronous data movement. Together, Prism and Bundl offer a practical path to safe, composable, high-performance GPU kernels, with clear potential for broader architectural support and data-race guarantees in future work.
Abstract
To achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to individual threads to control their behavior, while simultaneously tracking the convergence of many threads acting in concert to perform collective operations like Tensor Core instructions. The tension between these two mindsets makes modular programming error prone. Functions that encapsulate collective operations, despite being called per-thread, must be executed cooperatively by groups of threads. In this work, we introduce Prism, a new GPU language that restores modularity while still giving programmers the low-level control over collective operations necessary for high performance. Our core idea is typed perspectives, which materialize, at the type level, the granularity at which the programmer is controlling the behavior of threads. We describe the design of Prism, implement a compiler for it, and lay its theoretical foundations in a core calculus called Bundl. We implement state-of-the-art GPU kernels in Prism and find that it offers programmers the safety guarantees needed to confidently write modular code without sacrificing performance.
