Table of Contents
Fetching ...

Unified schemes for directive-based GPU offloading

Yohei Miki, Toshihiro Hanawa

TL;DR

The paper tackles the challenge of porting CPU-originated codes to GPUs across multiple vendors by introducing Solomon, a header-only macro library that unifies OpenACC and OpenMP target interfaces. It delivers three notations—an intuitive form plus OpenACC-like and OpenMP-like styles—to ease adoption for both novices and experts and demonstrates the approach on an $N$-body simulation and a $3$D diffusion equation. Solomon enables a single codebase to run with OpenACC on NVIDIA GPUs or OpenMP target on NVIDIA/AMD/Intel GPUs, while preserving the ability to compare backends fairly and transparently. The results show cross-vendor offloading capability with reasonable performance across architectures, highlighting the practical impact of reducing vendor lock-in and learning costs in directive-based GPU programming. Overall, Solomon provides a portable, readable, and easily maintainable path for directive-based GPU offloading across diverse hardware.

Abstract

GPU is the dominant accelerator device due to its high performance and energy efficiency. Directive-based GPU offloading using OpenACC or OpenMP target is a convenient way to port existing codes originally developed for multicore CPUs. Although OpenACC and OpenMP target provide similar features, both methods have pros and cons. OpenACC has better functions and an abundance of documents, but it is virtually for NVIDIA GPUs. OpenMP target supports NVIDIA/AMD/Intel GPUs but has fewer functions than OpenACC. Here, we have developed a header-only library, Solomon (Simple Off-LOading Macros Orchestrating multiple Notations), to unify the interface for GPU offloading with the support of both OpenACC and OpenMP target. Solomon provides three types of notations to reduce users' implementation and learning costs: intuitive notation for beginners and OpenACC/OpenMP-like notations for experienced developers. This manuscript denotes Solomon's implementation and usage and demonstrates the GPU-offloading in $N$-body simulation and the three-dimensional diffusion equation. The library and sample codes are provided as open-source software and publicly and freely available at \url{https://github.com/ymiki-repo/solomon}.

Unified schemes for directive-based GPU offloading

TL;DR

The paper tackles the challenge of porting CPU-originated codes to GPUs across multiple vendors by introducing Solomon, a header-only macro library that unifies OpenACC and OpenMP target interfaces. It delivers three notations—an intuitive form plus OpenACC-like and OpenMP-like styles—to ease adoption for both novices and experts and demonstrates the approach on an -body simulation and a D diffusion equation. Solomon enables a single codebase to run with OpenACC on NVIDIA GPUs or OpenMP target on NVIDIA/AMD/Intel GPUs, while preserving the ability to compare backends fairly and transparently. The results show cross-vendor offloading capability with reasonable performance across architectures, highlighting the practical impact of reducing vendor lock-in and learning costs in directive-based GPU programming. Overall, Solomon provides a portable, readable, and easily maintainable path for directive-based GPU offloading across diverse hardware.

Abstract

GPU is the dominant accelerator device due to its high performance and energy efficiency. Directive-based GPU offloading using OpenACC or OpenMP target is a convenient way to port existing codes originally developed for multicore CPUs. Although OpenACC and OpenMP target provide similar features, both methods have pros and cons. OpenACC has better functions and an abundance of documents, but it is virtually for NVIDIA GPUs. OpenMP target supports NVIDIA/AMD/Intel GPUs but has fewer functions than OpenACC. Here, we have developed a header-only library, Solomon (Simple Off-LOading Macros Orchestrating multiple Notations), to unify the interface for GPU offloading with the support of both OpenACC and OpenMP target. Solomon provides three types of notations to reduce users' implementation and learning costs: intuitive notation for beginners and OpenACC/OpenMP-like notations for experienced developers. This manuscript denotes Solomon's implementation and usage and demonstrates the GPU-offloading in -body simulation and the three-dimensional diffusion equation. The library and sample codes are provided as open-source software and publicly and freely available at \url{https://github.com/ymiki-repo/solomon}.

Paper Structure

This paper contains 13 sections, 2 equations, 2 figures, 20 tables.

Figures (2)

  • Figure 1: Measured performance of $N$-body simulations. The upper panels show the number of processed interaction pairs per second (the best performance in ten measurements) as a function of the number of $N$-body particles $N$. The open diamonds with a dotted line indicate the measured performance of the fastest implementation in each environment MikiHanawa2024. The lower panels exhibit the performance ratio of OpenACC/OpenMP-offloaded implementations from the fastest implementation in MikiHanawa2024. Each panel shows the measured performance on NVIDIA H100 SXM 80GB, NVIDIA GH200 480GB, AMD Instinct MI210, and Intel Data Center GPU Max 1100 from left to right.
  • Figure 2: Measured performance of 3D diffusion equation. The panels display the best performance in ten measurements as a function of the total number of meshes $N_x N_y N_z$. The panels compare the measured performance of OpenMP target with loop (the red squares with a dot-dashed line), OpenMP target with distribute (the red circles with a dashed line), OpenACC with kernels (the blue lower triangles with a dot-dashed line), and OpenACC with parallel (the blue upper triangles with a dashed line). Each panel shows the measured performance on NVIDIA H100 SXM 80GB, NVIDIA GH200 480GB, AMD Instinct MI210, and Intel Data Center GPU Max 1100 from left to right.