Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core
Robert Jendersie, Christian Lessig, Thomas Richter
TL;DR
This work evaluates GPU-based parallelization strategies for the neXtSIM-DG sea ice dynamical core to enable kilometer-scale, high-resolution simulations. By porting the CPU code to CUDA, SYCL, Kokkos, and PyTorch and focusing on the dominant stress-update kernel within the mEVP loop, the study benchmarks usability and performance across frameworks. The results show CUDA as the most mature and fastest path, Kokkos offering strong portability with comparable speed, SYCL as currently unreliable, and PyTorch lagging behind though promising with TorchInductor. The authors recommend a full port with Kokkos for robust cross-hardware performance and outline future work on mixed precision, while providing open access to code and experiments for reproducibility and further development.
Abstract
The cryosphere plays a significant role in Earth's climate system. Therefore, an accurate simulation of sea ice is of great importance to improve climate projections. To enable higher resolution simulations, graphics processing units (GPUs) have become increasingly attractive as they offer higher floating point peak performance and better energy efficiency compared to CPUs. However, making use of this theoretical peak performance, which is based on massive data parallelism, usually requires more care and effort in the implementation. In recent years, a number of frameworks have become available that promise to simplify general purpose GPU programming. In this work, we compare multiple such frameworks, including CUDA, SYCL, Kokkos and PyTorch, for the parallelization of \nextsim, a finite-element based dynamical core for sea ice. We evaluate the different approaches according to their usability and performance.
