PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch
Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu
TL;DR
PyGraph addresses the CPU launch latency bottleneck in ML workloads by making CUDA Graphs more accessible and effective through a compiler-based approach. It introduces three optimizations—CUDA Graph-aware Code Transformation (CGCT), Parameter Indirection (PI), and Selective CUDA Graphs (SCG)—to widen graph capture, eliminate data-copy overheads, and gate CG deployment based on end-to-end benefit. Across 25 ML workloads and distributed settings, PyGraph delivers substantial end-to-end speedups over the PyTorch2 CG baseline, while avoiding regressions in cases where CGs are detrimental. The work demonstrates that compiler-driven, zero-modification integration can significantly enhance GPU utilization and performance for modern ML workloads.
Abstract
Machine learning (ML) workloads launch hundreds to thousands of short-running GPU kernels per iteration. With GPU compute throughput growing rapidly, CPU-side launch latency of kernels is emerging as a bottleneck. CUDA Graphs promise to address this by replaying a set of kernels with a single dispatch of the graph, removing per-kernel launch costs. However, CUDA Graphs remain surprisingly difficult to deploy correctly and efficiently. We present PyGraph - a compiler framework to maximize the coverage and benefits of CUDA Graphs for ML workloads. It introduces three novel optimizations: it applies automatic code transformations to make ML applications amenable to CUDA Graphs; it eliminates the parameter copy overheads for kernels executing in CUDA Graphs, and it selectively deploys CUDA Graphs guided by a cost-benefit analysis. For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2. PyGraph is built atop PyTorch2's compilation framework and requires no programmer intervention.
