Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware
Pawel K. Radtke, Tobias Weinzierl
TL;DR
Data movement is the bottleneck in heterogeneous HPC, motivating compiler-supported AoS-to-SoA transformations and reduced-precision storage. The authors introduce C++ annotations and a Clang/LLVM-based FrontendAction to enable on-the-fly data-layout changes and offloading, complemented by a formal operator framework and multiple CPU/GPU pathways. They implement and evaluate these techniques on SPH workloads across four accelerator platforms, revealing substantial GPU speedups for memory-bound kernels and strong vendor-dependent behavior for precision reduction. The work demonstrates the potential of automated, architecture-aware data-layout and precision strategies while highlighting the need for adaptive, hardware-tuned approaches in future research.
Abstract
This study evaluates AoS-to-SoA transformations over reduced-precision data layouts for a particle simulation code on several GPU platforms: We hypothesize that SoA fits particularly well to SIMT, while AoS is the preferred storage format for many Lagrangian codes. Reduced-precision (below IEEE accuracy) is an established tool to address bandwidth constraints, although it remains unclear whether AoS and precision conversions should execute on a CPU or be deployed to a GPU if the compute kernel itself must run on an accelerator. On modern superchips where CPUs and GPUs share (logically) one data space, it is also unclear whether it is advantageous to stream data to the accelerator prior to the calculation, or whether we should let the accelerator transform data on demand, i.e.~work in-place logically. We therefore introduce compiler annotations to facilitate such conversions and to give the programmer the option to orchestrate the conversions in combination with GPU offloading. For some of our compute kernels of interest, Nvidia's G200 platforms yield a speedup of around 2.6 while AMD's MI300A exhibits more robust performance yet profits less. We assume that our compiler-based techniques are applicable to a wide variety of Lagrangian codes and beyond.
