Table of Contents
Fetching ...

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Chayanon, Wichitrnithed, Woo-Sun-Yang, Yun, He, Brad Richardson, Koichi Sakaguchi, Manuel Arenaz, William I. Gustafson, Jacob Shpund, Ulises Costi Blanco, Alvaro Goldar Dieste

TL;DR

This work addresses accelerating a hotspot in the Weather Research and Forecasting model by porting parts of the 33-bin FSBM microphysics routine to NVIDIA GPUs using OpenMP device offloading. A workflow combining runtime profiling with the Codee static analysis tool guides a sequence of refactorings that remove data dependencies and enable deeper loop collapses, yielding substantial speedups. The study reports up to 2.08x overall improvement on a CONUS-12km test case and discusses memory-bound constraints and occupancy considerations, offering a practical blueprint for GPU acceleration of legacy weather codes. The findings underscore the value of integrating static modernization tools with runtime profiling to accelerate and validate large HPC codes, while outlining directions for extending GPU offloading to other microphysics components.

Abstract

Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To facilitate this process, we explore a workflow for optimization which uses both runtime profilers and a static code inspection tool Codee to refactor the subroutine. We observe a 2.08x overall speedup for the CONUS-12km thunderstorm test case.

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

TL;DR

This work addresses accelerating a hotspot in the Weather Research and Forecasting model by porting parts of the 33-bin FSBM microphysics routine to NVIDIA GPUs using OpenMP device offloading. A workflow combining runtime profiling with the Codee static analysis tool guides a sequence of refactorings that remove data dependencies and enable deeper loop collapses, yielding substantial speedups. The study reports up to 2.08x overall improvement on a CONUS-12km test case and discusses memory-bound constraints and occupancy considerations, offering a practical blueprint for GPU acceleration of legacy weather codes. The findings underscore the value of integrating static modernization tools with runtime profiling to accelerate and validate large HPC codes, while outlining directions for extending GPU offloading to other microphysics components.

Abstract

Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To facilitate this process, we explore a workflow for optimization which uses both runtime profilers and a static code inspection tool Codee to refactor the subroutine. We observe a 2.08x overall speedup for the CONUS-12km thunderstorm test case.
Paper Structure (15 sections, 1 equation, 4 figures, 7 tables)

This paper contains 15 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: WRF decomposition layer. Diagram from MichalakesUnknown-db.
  • Figure 2: Comparison of bulk and bin microphysics schemes. Image from Morrison2020-zn.
  • Figure 3: The solid lines form rooflines, with the top horizontal line for single precision and the bottom one for double precision. The green and brown circles at the bottom are the observed values with single and double precisions, respectively, when collapsing the two outermost loops. The pair of points above are when collapsing three loops.
  • Figure 4: Total elapsed time for different versions of the code. For the GPU version, the number of GPUs is fixed to 16. In the rightmost group, the CPU codes run on 256 cores while the GPU code runs on 40 cores and 8 GPUs.