SYCL compute kernels for ExaHyPE
Chung Ming Loi, Heinrich Bockhorst, Tobias Weinzierl
TL;DR
Three SYCL realizations are explored for a block-structured Finite Volume kernel in ExaHyPE, mapping compute graphs to for-loops, nested parallelism, and a DAG-based task graph. The study evaluates patch-wise, batched, and task-graph realizations on GPUs (A100) and Intel PVC, comparing data layouts and memory movement strategies while using the Rusanov flux for the Euler equations. The results show that, when mapped to a purely data-parallel SYCL implementation, a hybrid of task and data parallelism delivers best performance, while dynamic task graphs introduce substantial overhead. The work provides practical guidance on SYCL kernel orchestration for heterogeneous HPC codes and highlights ongoing challenges in nested parallelism and data management.
Abstract
We discuss three SYCL realisations of a simple Finite Volume scheme over multiple Cartesian patches. The realisation flavours differ in the way how they map the compute steps onto loops and tasks: We compare an implementation that is exclusively using a sequence of for-loops to a version that uses nested parallelism, and finally benchmark these against a version modelling the calculations as task graph. Our work proposes realisation idioms to realise these flavours within SYCL. The results suggest that a mixture of classic task and data parallelism performs if we map this hybrid onto a solely data-parallel SYCL implementation, taking into account SYCL specifics and the problem size.
