Table of Contents
Fetching ...

A Microbenchmark Framework for Performance Evaluation of OpenMP Target Offloading

Mohammad Atif, Tianle Wang, Zhihua Dong, Charles Leggett, Meifeng Lin

TL;DR

The paper addresses the challenge of evaluating OpenMP target offload performance amid rapid compiler and backend evolution on heterogeneous HPC GPUs. It introduces a Catch2-based microbenchmark framework with dynamic iteration counting and statistical bootstrapping, plus a Tabular Reporter, to produce robust, comparable timing metrics while avoiding kernel self-optimization. Validation against cuBLAS on Nvidia hardware demonstrates accurate timing measurements, and the authors showcase experiments across Perlmutter, Frontier, and local GPUs to compare OpenMP offload with native CUDA/HIP backends. The work offers a practical, reproducible tool for compiler developers and HPC users to quantify small performance changes, guide compiler and flag choices, and inform performance-portability studies.

Abstract

We present a framework based on Catch2 to evaluate performance of OpenMP's target offload model via micro-benchmarks. The compilers supporting OpenMP's target offload model for heterogeneous architectures are currently undergoing rapid development. These developments influence performance of various complex applications in different ways. This framework can be employed to track the impact of compiler upgrades and compare their performance with the native programming models. We use the framework to benchmark performance of a few commonly used operations on leadership class supercomputers such as Perlmutter at National Energy Research Scientific Computing (NERSC) Center and Frontier at Oak Ridge Leadership Computing Facility (OLCF). Such a framework will be useful for compiler developers to gain insights into the overall impact of many small changes, as well as for users to decide which compilers and versions are expected to yield best performance for their applications.

A Microbenchmark Framework for Performance Evaluation of OpenMP Target Offloading

TL;DR

The paper addresses the challenge of evaluating OpenMP target offload performance amid rapid compiler and backend evolution on heterogeneous HPC GPUs. It introduces a Catch2-based microbenchmark framework with dynamic iteration counting and statistical bootstrapping, plus a Tabular Reporter, to produce robust, comparable timing metrics while avoiding kernel self-optimization. Validation against cuBLAS on Nvidia hardware demonstrates accurate timing measurements, and the authors showcase experiments across Perlmutter, Frontier, and local GPUs to compare OpenMP offload with native CUDA/HIP backends. The work offers a practical, reproducible tool for compiler developers and HPC users to quantify small performance changes, guide compiler and flag choices, and inform performance-portability studies.

Abstract

We present a framework based on Catch2 to evaluate performance of OpenMP's target offload model via micro-benchmarks. The compilers supporting OpenMP's target offload model for heterogeneous architectures are currently undergoing rapid development. These developments influence performance of various complex applications in different ways. This framework can be employed to track the impact of compiler upgrades and compare their performance with the native programming models. We use the framework to benchmark performance of a few commonly used operations on leadership class supercomputers such as Perlmutter at National Energy Research Scientific Computing (NERSC) Center and Frontier at Oak Ridge Leadership Computing Facility (OLCF). Such a framework will be useful for compiler developers to gain insights into the overall impact of many small changes, as well as for users to decide which compilers and versions are expected to yield best performance for their applications.

Paper Structure

This paper contains 14 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: A sketch of Catch2 based microbenchmarking suite's workflow.
  • Figure 2: Comparing array initialization time on Nvidia V100 and A6000: Clang-15 vs CUDA-11.7 across datatypes and threads per block.
  • Figure 3: Comparing array initialization time on Perlmutter: varying threads per block and datatype.
  • Figure 4: Comparison of zaxpy execution time with the native programming models on Frontier and Perlmutter.
  • Figure 5: Zaxpy execution time on Perlmutter: varying threads per block and datatype.
  • ...and 8 more figures