Table of Contents
Fetching ...

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

Robert Jendersie, Christian Lessig, Thomas Richter

TL;DR

This work evaluates GPU-based parallelization strategies for the neXtSIM-DG sea ice dynamical core to enable kilometer-scale, high-resolution simulations. By porting the CPU code to CUDA, SYCL, Kokkos, and PyTorch and focusing on the dominant stress-update kernel within the mEVP loop, the study benchmarks usability and performance across frameworks. The results show CUDA as the most mature and fastest path, Kokkos offering strong portability with comparable speed, SYCL as currently unreliable, and PyTorch lagging behind though promising with TorchInductor. The authors recommend a full port with Kokkos for robust cross-hardware performance and outline future work on mixed precision, while providing open access to code and experiments for reproducibility and further development.

Abstract

The cryosphere plays a significant role in Earth's climate system. Therefore, an accurate simulation of sea ice is of great importance to improve climate projections. To enable higher resolution simulations, graphics processing units (GPUs) have become increasingly attractive as they offer higher floating point peak performance and better energy efficiency compared to CPUs. However, making use of this theoretical peak performance, which is based on massive data parallelism, usually requires more care and effort in the implementation. In recent years, a number of frameworks have become available that promise to simplify general purpose GPU programming. In this work, we compare multiple such frameworks, including CUDA, SYCL, Kokkos and PyTorch, for the parallelization of \nextsim, a finite-element based dynamical core for sea ice. We evaluate the different approaches according to their usability and performance.

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

TL;DR

This work evaluates GPU-based parallelization strategies for the neXtSIM-DG sea ice dynamical core to enable kilometer-scale, high-resolution simulations. By porting the CPU code to CUDA, SYCL, Kokkos, and PyTorch and focusing on the dominant stress-update kernel within the mEVP loop, the study benchmarks usability and performance across frameworks. The results show CUDA as the most mature and fastest path, Kokkos offering strong portability with comparable speed, SYCL as currently unreliable, and PyTorch lagging behind though promising with TorchInductor. The authors recommend a full port with Kokkos for robust cross-hardware performance and outline future work on mixed precision, while providing open access to code and experiments for reproducibility and further development.

Abstract

The cryosphere plays a significant role in Earth's climate system. Therefore, an accurate simulation of sea ice is of great importance to improve climate projections. To enable higher resolution simulations, graphics processing units (GPUs) have become increasingly attractive as they offer higher floating point peak performance and better energy efficiency compared to CPUs. However, making use of this theoretical peak performance, which is based on massive data parallelism, usually requires more care and effort in the implementation. In recent years, a number of frameworks have become available that promise to simplify general purpose GPU programming. In this work, we compare multiple such frameworks, including CUDA, SYCL, Kokkos and PyTorch, for the parallelization of \nextsim, a finite-element based dynamical core for sea ice. We evaluate the different approaches according to their usability and performance.
Paper Structure (15 sections, 3 equations, 2 figures, 2 tables)

This paper contains 15 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Total time spend on the stress computation for the different PyTorch variants on an A100. The products with ${\color{tolgreen}M^{-1}}$ are implemented either as batched matrix-matrix product (bmm) or element-wise product and sum ($*$,sum).
  • Figure 2: Timings of the stress update using the best performing version for each framework. The size of the mesh cells size is scaled from 4km to 0.25km while keeping the domain size constant to increase the number of elements. Dashed variants are run with lower floating-point precision.