Table of Contents
Fetching ...

DiOMP-Offloading: Toward Portable Distributed Heterogeneous OpenMP

Baodi Shan, Mauricio Araya-Polo, Barbara Chapman

TL;DR

This work tackles portability and performance challenges in distributed heterogeneous HPC by unifying PGAS-style global memory with OpenMP target offloading. The proposed DiOMP-Offloading framework builds a unified runtime atop LLVM/OpenMP and GASNet-EX (or GPI-2), enabling transparent remote memory access across GPUs with symmetric or asymmetric allocations. A key contribution is OMPCCL, a portable device-side collective layer that interoperates with vendor libraries like NCCL/RCCL, coupled with a DiOMP Group abstraction to flexibly scope collectives. Comprehensive evaluations on NVIDIA A100, Grace Hopper, and AMD MI250X show DiOMP improves point-to-point and collective communication, accelerates matrix multiplication via overlapped computation/communication, and reduces programming effort in distributed scenarios. Collectively, DiOMP-Offloading demonstrates scalable, portable performance for heterogeneous HPC workloads and provides a blueprint for integrating PGAS with directive-based models like OpenMP.

Abstract

As core counts and heterogeneity rise in HPC, traditional hybrid programming models face challenges in managing distributed GPU memory and ensuring portability. This paper presents DiOMP, a distributed OpenMP framework that unifies OpenMP target offloading with the Partitioned Global Address Space (PGAS) model. Built atop LLVM/OpenMP and using GASNet-EX or GPI-2 for communication, DiOMP transparently handles global memory, supporting both symmetric and asymmetric GPU allocations. It leverages OMPCCL, a portable collective communication layer compatible with vendor libraries. DiOMP simplifies programming by abstracting device memory and communication, achieving superior scalability and programmability over traditional approaches. Evaluations on NVIDIA A100, Grace Hopper, and AMD MI250X show improved performance in micro-benchmarks and applications like matrix multiplication and Minimod, highlighting DiOMP's potential for scalable, portable, and efficient heterogeneous computing.

DiOMP-Offloading: Toward Portable Distributed Heterogeneous OpenMP

TL;DR

This work tackles portability and performance challenges in distributed heterogeneous HPC by unifying PGAS-style global memory with OpenMP target offloading. The proposed DiOMP-Offloading framework builds a unified runtime atop LLVM/OpenMP and GASNet-EX (or GPI-2), enabling transparent remote memory access across GPUs with symmetric or asymmetric allocations. A key contribution is OMPCCL, a portable device-side collective layer that interoperates with vendor libraries like NCCL/RCCL, coupled with a DiOMP Group abstraction to flexibly scope collectives. Comprehensive evaluations on NVIDIA A100, Grace Hopper, and AMD MI250X show DiOMP improves point-to-point and collective communication, accelerates matrix multiplication via overlapped computation/communication, and reduces programming effort in distributed scenarios. Collectively, DiOMP-Offloading demonstrates scalable, portable performance for heterogeneous HPC workloads and provides a blueprint for integrating PGAS with directive-based models like OpenMP.

Abstract

As core counts and heterogeneity rise in HPC, traditional hybrid programming models face challenges in managing distributed GPU memory and ensuring portability. This paper presents DiOMP, a distributed OpenMP framework that unifies OpenMP target offloading with the Partitioned Global Address Space (PGAS) model. Built atop LLVM/OpenMP and using GASNet-EX or GPI-2 for communication, DiOMP transparently handles global memory, supporting both symmetric and asymmetric GPU allocations. It leverages OMPCCL, a portable collective communication layer compatible with vendor libraries. DiOMP simplifies programming by abstracting device memory and communication, achieving superior scalability and programmability over traditional approaches. Evaluations on NVIDIA A100, Grace Hopper, and AMD MI250X show improved performance in micro-benchmarks and applications like matrix multiplication and Minimod, highlighting DiOMP's potential for scalable, portable, and efficient heterogeneous computing.

Paper Structure

This paper contains 15 sections, 10 figures.

Figures (10)

  • Figure 1: Comparison of data management and communication workflows between OpenMP Target + MPI and DiOMP-Offloading. (a) In the OpenMP Target + MPI approach, libomptarget and MPI manage GPU memory separately, each maintaining its own metadata and performing independent memory registration via distinct APIs (e.g., CUDA Driver and MPI windows). This separation leads to redundant memory handling, inconsistent synchronization (e.g., OpenMP implicit barrier vs. MPI fence), and uncoordinated data lifecycles. (b) DiOMP-Offloading provides a unified runtime that integrates OpenMP target regions and communication functions. It manages a centralized mapping table and coordinates memory registration and synchronization, avoiding duplication and ensuring consistency across layers.
  • Figure 2: Symmetric and asymmetric memory allocation in DiOMP Offloading.
  • Figure 3: Latency comparison of DiOMP and MPI operations using InfiniBand and HPE Slingshot 11 from 4 bytes to 8KB. Lower is better.
  • Figure 4: Bandwidth comparison of DiOMP and MPI operations using InfiniBand and HPE Slingshot 11 across varying data sizes. *The anomalous behavior of DiOMP Put in Slingshot 11 + A100 has been addressed below. Higher is better.
  • Figure 5: Bandwidth comparison of two DiOMP implementations (GASNet-EX and GPI-2) over NDR InfiniBand.
  • ...and 5 more figures