Table of Contents
Fetching ...

Static Generation of Efficient OpenMP Offload Data Mappings

Luke Marzen, Akash Dutta, Ali Jannesari

TL;DR

OpenMP offload data movement is a critical bottleneck on heterogeneous HPC systems. The paper presents OMPDart, a static, interprocedural, context-sensitive data-flow analysis that uses a hybrid AST-CFG representation to identify host–device data dependencies and automatically insert OpenMP data-mapping directives. Evaluated on nine HPC benchmarks from Rodinia and HeCBench, OMPDart achieves substantial reductions in host-to-device transfers and performance that is comparable to or better than expert mappings, including a notable $1.6\times$ speedup on lulesh and a geometric mean speedup of $2.8\times$ over default mappings. The approach is compiler-agnostic and operates as a source-to-source transformation, demonstrating practical impact by automating a long-standing optimization in OpenMP offloading.

Abstract

Increasing heterogeneity in HPC architectures and compiler advancements have led to OpenMP being frequently used to enable computations on heterogeneous devices. However, the efficient movement of data on heterogeneous computing platforms is crucial for achieving high utilization. Programmers must explicitly map data between the host and connected accelerator devices to achieve efficient data movement. Ensuring efficient data transfer requires programmers to reason about complex data flow. This can be a laborious and error-prone process since the programmer must keep a mental model of data validity and lifetime spanning multiple data environments. We present a static analysis tool, OMPDart (OpenMP Data Reduction Tool), for OpenMP programs that models data dependencies between host and device regions and applies source code transformations to achieve efficient data transfer. Our evaluations on nine HPC benchmarks demonstrate that OMPDart is capable of generating effective data mapping constructs that substantially reduce data transfer between host and device.

Static Generation of Efficient OpenMP Offload Data Mappings

TL;DR

OpenMP offload data movement is a critical bottleneck on heterogeneous HPC systems. The paper presents OMPDart, a static, interprocedural, context-sensitive data-flow analysis that uses a hybrid AST-CFG representation to identify host–device data dependencies and automatically insert OpenMP data-mapping directives. Evaluated on nine HPC benchmarks from Rodinia and HeCBench, OMPDart achieves substantial reductions in host-to-device transfers and performance that is comparable to or better than expert mappings, including a notable speedup on lulesh and a geometric mean speedup of over default mappings. The approach is compiler-agnostic and operates as a source-to-source transformation, demonstrating practical impact by automating a long-standing optimization in OpenMP offloading.

Abstract

Increasing heterogeneity in HPC architectures and compiler advancements have led to OpenMP being frequently used to enable computations on heterogeneous devices. However, the efficient movement of data on heterogeneous computing platforms is crucial for achieving high utilization. Programmers must explicitly map data between the host and connected accelerator devices to achieve efficient data movement. Ensuring efficient data transfer requires programmers to reason about complex data flow. This can be a laborious and error-prone process since the programmer must keep a mental model of data validity and lifetime spanning multiple data environments. We present a static analysis tool, OMPDart (OpenMP Data Reduction Tool), for OpenMP programs that models data dependencies between host and device regions and applies source code transformations to achieve efficient data transfer. Our evaluations on nine HPC benchmarks demonstrate that OMPDart is capable of generating effective data mapping constructs that substantially reduce data transfer between host and device.
Paper Structure (17 sections, 1 equation, 6 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Workflow used by OMPDart for identifying data dependencies and transforming source code for C and C++ OpenMP GPU offloading applications.
  • Figure 2: Example of AST-CFG representation. The graphical figure on the right is the AST-CFG representation of the code on the left.
  • Figure 3: Comparison of GPU data transfer activity (bytes) (lower is better). HtoD: Data transfer from CPU to GPU. DtoH: Data transfer from GPU to CPU.
  • Figure 4: Comparison of GPU data transfer activity (# calls) (lower is better). HtoD: Data transfer from CPU to GPU. DtoH: Data transfer from GPU to CPU.
  • Figure 5: Speedups over unoptimized OpenMP offload code. (Higher is better)
  • ...and 1 more figures