Table of Contents
Fetching ...

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Jonah Ekelund, Stefano Markidis, Ivy Peng

TL;DR

This paper tackles the overhead of repeatedly launching fine-grained kernels in iterative GPU applications by batching kernel launches into iteration batches and unrolling them into a static CUDA Graph. A performance model is developed to predict the optimal batch size that balances graph-creation overhead with execution gains, and a skeleton iterative application validates the approach alongside real workloads (Hotspot, Hotspot3D, and an FDTD Maxwell solver). The study shows that, with an appropriate batch size, speedups up to about 1.4x are achievable, and gains persist across different NVIDIA architectures, while larger workloads reduce relative benefits. The findings provide a practical methodology and model for porting iterative GPU workloads to CUDA Graph, with guidance on batching, measurements, and potential future automation for graph construction.

Abstract

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The skeleton application also serves as a generalized example of converting an iterative solver to CUDA Graph, and for deriving a performance model. Using the skeleton application, we show that when unrolling iteration batches for a given platform, there is an optimal size of the iteration batch, which is independent of workload, balancing the extra overhead from graph creation with the performance gain of the graph execution. Depending on workload, we show that the optimal iteration batch size gives more than 1.4x speed-up in the skeleton application. Furthermore, we show that similar speed-up can be gained in Hotspot and Hotspot3D from the Rodinia benchmark suite and a Finite-Difference Time-Domain (FDTD) Maxwell solver.

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

TL;DR

This paper tackles the overhead of repeatedly launching fine-grained kernels in iterative GPU applications by batching kernel launches into iteration batches and unrolling them into a static CUDA Graph. A performance model is developed to predict the optimal batch size that balances graph-creation overhead with execution gains, and a skeleton iterative application validates the approach alongside real workloads (Hotspot, Hotspot3D, and an FDTD Maxwell solver). The study shows that, with an appropriate batch size, speedups up to about 1.4x are achievable, and gains persist across different NVIDIA architectures, while larger workloads reduce relative benefits. The findings provide a practical methodology and model for porting iterative GPU workloads to CUDA Graph, with guidance on batching, measurements, and potential future automation for graph construction.

Abstract

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The skeleton application also serves as a generalized example of converting an iterative solver to CUDA Graph, and for deriving a performance model. Using the skeleton application, we show that when unrolling iteration batches for a given platform, there is an optimal size of the iteration batch, which is independent of workload, balancing the extra overhead from graph creation with the performance gain of the graph execution. Depending on workload, we show that the optimal iteration batch size gives more than 1.4x speed-up in the skeleton application. Furthermore, we show that similar speed-up can be gained in Hotspot and Hotspot3D from the Rodinia benchmark suite and a Finite-Difference Time-Domain (FDTD) Maxwell solver.
Paper Structure (11 sections, 6 equations, 9 figures, 1 table)

This paper contains 11 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: A schematic representation of the iteration batch unrolling strategy presented in this paper.
  • Figure 2: Graph creation overhead and memory usage in the skeleton application as the size of the iteration batch increases. The creation overhead is the line plot with the y-axis on the left side and the memory usage is the bar plot with the y-axis on the right.
  • Figure 3: Launch and execution time for the graph version of the skeleton application as the size of the iteration batch increases. Note the logarithmic scale of the x-axis, and that each workload has a separate y-axis.
  • Figure 4: Skeleton application performance for different iteration batch sizes and 10,000 kernel iterations. The ratio between the graph creation time plus the total execution time for each graph size and the graph size with the lowest mean execution time. Note the logarithmic scale of the x-axis.
  • Figure 5: Execution trace from Nsight Systems. The first two graphs executed on the GPU are enclosed in boxes.
  • ...and 4 more figures