Table of Contents
Fetching ...

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu

TL;DR

PyGraph addresses the CPU launch latency bottleneck in ML workloads by making CUDA Graphs more accessible and effective through a compiler-based approach. It introduces three optimizations—CUDA Graph-aware Code Transformation (CGCT), Parameter Indirection (PI), and Selective CUDA Graphs (SCG)—to widen graph capture, eliminate data-copy overheads, and gate CG deployment based on end-to-end benefit. Across 25 ML workloads and distributed settings, PyGraph delivers substantial end-to-end speedups over the PyTorch2 CG baseline, while avoiding regressions in cases where CGs are detrimental. The work demonstrates that compiler-driven, zero-modification integration can significantly enhance GPU utilization and performance for modern ML workloads.

Abstract

Machine learning (ML) workloads launch hundreds to thousands of short-running GPU kernels per iteration. With GPU compute throughput growing rapidly, CPU-side launch latency of kernels is emerging as a bottleneck. CUDA Graphs promise to address this by replaying a set of kernels with a single dispatch of the graph, removing per-kernel launch costs. However, CUDA Graphs remain surprisingly difficult to deploy correctly and efficiently. We present PyGraph - a compiler framework to maximize the coverage and benefits of CUDA Graphs for ML workloads. It introduces three novel optimizations: it applies automatic code transformations to make ML applications amenable to CUDA Graphs; it eliminates the parameter copy overheads for kernels executing in CUDA Graphs, and it selectively deploys CUDA Graphs guided by a cost-benefit analysis. For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2. PyGraph is built atop PyTorch2's compilation framework and requires no programmer intervention.

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

TL;DR

PyGraph addresses the CPU launch latency bottleneck in ML workloads by making CUDA Graphs more accessible and effective through a compiler-based approach. It introduces three optimizations—CUDA Graph-aware Code Transformation (CGCT), Parameter Indirection (PI), and Selective CUDA Graphs (SCG)—to widen graph capture, eliminate data-copy overheads, and gate CG deployment based on end-to-end benefit. Across 25 ML workloads and distributed settings, PyGraph delivers substantial end-to-end speedups over the PyTorch2 CG baseline, while avoiding regressions in cases where CGs are detrimental. The work demonstrates that compiler-driven, zero-modification integration can significantly enhance GPU utilization and performance for modern ML workloads.

Abstract

Machine learning (ML) workloads launch hundreds to thousands of short-running GPU kernels per iteration. With GPU compute throughput growing rapidly, CPU-side launch latency of kernels is emerging as a bottleneck. CUDA Graphs promise to address this by replaying a set of kernels with a single dispatch of the graph, removing per-kernel launch costs. However, CUDA Graphs remain surprisingly difficult to deploy correctly and efficiently. We present PyGraph - a compiler framework to maximize the coverage and benefits of CUDA Graphs for ML workloads. It introduces three novel optimizations: it applies automatic code transformations to make ML applications amenable to CUDA Graphs; it eliminates the parameter copy overheads for kernels executing in CUDA Graphs, and it selectively deploys CUDA Graphs guided by a cost-benefit analysis. For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2. PyGraph is built atop PyTorch2's compilation framework and requires no programmer intervention.

Paper Structure

This paper contains 22 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: CUDA Graph reducing CPU launch overheads.
  • Figure 2: Simplified code snippet from speech transformer model with CPU-resident scalar (marked red).
  • Figure 3: Performance impact of moving tensors from CPU (unoptimized programs) to GPU (optimized programs).
  • Figure 4: Cost-benefit analysis of CUDA Graphs.
  • Figure 5: Converting data (parameter value) copy to pointer copy through indirection in Parameter Indirection.
  • ...and 6 more figures