Table of Contents
Fetching ...

Portability of Fortran's `do concurrent' on GPUs

Ronald M. Caplan, Miko M. Stulajter, Jon A. Linker, Jeff Larkin, Henry A. Gabb, Shiquan Su, Ivan Rodriguez, Zachary Tschirhart, Nicholas Malaya

TL;DR

The paper investigates how portable Fortran do concurrent (DC) offload is across NVIDIA, Intel, and AMD GPUs by applying it to HipFT, a production solar surface flux transport code. It compares pure Fortran DC with minimal OpenMP/OpenACC data-management directives, across three vendor toolchains, including memory-management choices (separate, managed, unified) where supported. Key findings show that DC offload can be ported with competitive performance on NVIDIA and Intel GPUs, especially with unified memory on architectures like Grace-Hopper; AMD support is emerging and currently more limited. The results demonstrate the viability of standard language parallelism for cross-vendor HPC codes and provide practical guidance for compiler developers and hardware vendors on data-management and device-management needs to maximize portability and performance.

Abstract

There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code, HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.

Portability of Fortran's `do concurrent' on GPUs

TL;DR

The paper investigates how portable Fortran do concurrent (DC) offload is across NVIDIA, Intel, and AMD GPUs by applying it to HipFT, a production solar surface flux transport code. It compares pure Fortran DC with minimal OpenMP/OpenACC data-management directives, across three vendor toolchains, including memory-management choices (separate, managed, unified) where supported. Key findings show that DC offload can be ported with competitive performance on NVIDIA and Intel GPUs, especially with unified memory on architectures like Grace-Hopper; AMD support is emerging and currently more limited. The results demonstrate the viability of standard language parallelism for cross-vendor HPC codes and provide practical guidance for compiler developers and hardware vendors on data-management and device-management needs to maximize portability and performance.

Abstract

There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code, HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.
Paper Structure (15 sections, 7 figures)

This paper contains 15 sections, 7 figures.

Figures (7)

  • Figure 1: Visualization of the HipFT test case used to evaluate the DC implementations. We show the maps at times 0, 200, 400, and 600 hours (left to right) for four of the eight realizations (top to bottom).
  • Figure 2: Run time comparison (less is better) of the HipFT test case on Intel data center GPUs between the original HipFT code and the modified code (adding !$omp parallel loop to the inner nested DC loops). We see a substantial performance improvement with the modified code (for these results, the file I/O time has been omitted from the "Other" category).
  • Figure 3: Timing results (less is better) of the HipFT test case for server/data center CPUs and GPUs.
  • Figure 4: Run times (less is better) of the HipFT test case on the Intel Arc 750 LE consumer GPU. We show the original code, the slightly modified code (*), and the modified code running the alternative advection algorithm (upwind).
  • Figure 5: Timing results (less is better) of the HipFT test case for consumer CPUs and GPUs. These results use the upwind method for advection; therefore, they cannot be directly compared to the server results in Fig. \ref{['fig:results_server']}.
  • ...and 2 more figures