Table of Contents
Fetching ...

Accelerating the Dutch Atmospheric Large-Eddy Simulation (DALES) model with OpenACC

Lucas Esclapez, Laurent Soucasse, Caspar Jungbacker, Fredrik Jansson, Stephan R. de Roode, Pedro Costa, Gijs van den Oord, Alessio Sclocco

TL;DR

The paper demonstrates a directive-based OpenACC port of the Dutch Atmospheric Large-Eddy Simulation (DALES) model to GPUs, enabling high-resolution atmospheric simulations with minimal code disruption. It details the modelling framework, the porting strategy (data management, loop collapsing, selective refactoring), and the integration of GPU-accelerated libraries such as RRTMGP and cuFFT. Through Cloud Botany reference cases, it validates numerical consistency with CPU runs and characterizes single-node performance across NVIDIA A100 and H100 GPUs, revealing strong speedups but limited weak-scaling due to FFT-based communications. The study also explores Kernel Tuner for auto-tuning stencil kernels, reporting meaningful gains on A100 for select kernels but only modest overall improvements when scaled across the code base, and discusses future work on AMD GPUs, alternative Poisson solvers, and mixed-precision acceleration.

Abstract

This paper presents the GPU porting through OpenACC directives of the Dutch Atmospheric Large-Eddy Simulation (DALES) application, a high-resolution atmospheric model. The code is written in Fortran~90 and features parallel (distributed) execution through spatial domain decomposition. We assess the performance of the GPU offloading, comparing the time-to-solution on regular and accelerated HPC nodes. %comparing the computational time between distributed and accelerated nodes. A weak scaling analysis is conducted and portability across NVIDIA A100 and H100 hardware %and AMD hardware is discussed. Finally, we show how targeted kernels can benefit from further optimization with Kernel Tuner, a GPU kernels auto-tuning package.

Accelerating the Dutch Atmospheric Large-Eddy Simulation (DALES) model with OpenACC

TL;DR

The paper demonstrates a directive-based OpenACC port of the Dutch Atmospheric Large-Eddy Simulation (DALES) model to GPUs, enabling high-resolution atmospheric simulations with minimal code disruption. It details the modelling framework, the porting strategy (data management, loop collapsing, selective refactoring), and the integration of GPU-accelerated libraries such as RRTMGP and cuFFT. Through Cloud Botany reference cases, it validates numerical consistency with CPU runs and characterizes single-node performance across NVIDIA A100 and H100 GPUs, revealing strong speedups but limited weak-scaling due to FFT-based communications. The study also explores Kernel Tuner for auto-tuning stencil kernels, reporting meaningful gains on A100 for select kernels but only modest overall improvements when scaled across the code base, and discusses future work on AMD GPUs, alternative Poisson solvers, and mixed-precision acceleration.

Abstract

This paper presents the GPU porting through OpenACC directives of the Dutch Atmospheric Large-Eddy Simulation (DALES) application, a high-resolution atmospheric model. The code is written in Fortran~90 and features parallel (distributed) execution through spatial domain decomposition. We assess the performance of the GPU offloading, comparing the time-to-solution on regular and accelerated HPC nodes. %comparing the computational time between distributed and accelerated nodes. A weak scaling analysis is conducted and portability across NVIDIA A100 and H100 hardware %and AMD hardware is discussed. Finally, we show how targeted kernels can benefit from further optimization with Kernel Tuner, a GPU kernels auto-tuning package.

Paper Structure

This paper contains 23 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: 3D visualization of the clouds in one Cloud Botany simulation. Clouds are shown in white and rain is shown in gray. The air temperature near the surface is shown in blue, dark blue areas being cold pools associated with rain.
  • Figure 2: Vertical profiles of time and planar-averaged fields: temperature, resolved turbulent kinetic energy (TKE), liquid water specific humidity $q_l$, cloud fraction, shortwave and longwave net radiative fluxes (counted positively from Earth to space). Plot for both GPU and CPU implementation.
  • Figure 3: Weak scaling of DALES on the botany case using Snellius A100 GPUs
  • Figure 4: Run times of the subgrid module with three optimisation strategies (using the cache directive, using kernel fission and using ijk collapsed loop and n sequential loop), relative to the baseline acceleration strategy.
  • Figure 5: Measured kernel timing distribution for the diffcsv kernel, varying the number of scalar $n$. On each subplot, the left and right panel corresponds to A100 and H100 GPUs respectively. Dashed red lines indicate default parameters timing.
  • ...and 2 more figures