Table of Contents
Fetching ...

Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

Pawel K. Radtke, Tobias Weinzierl

TL;DR

Data movement is the bottleneck in heterogeneous HPC, motivating compiler-supported AoS-to-SoA transformations and reduced-precision storage. The authors introduce C++ annotations and a Clang/LLVM-based FrontendAction to enable on-the-fly data-layout changes and offloading, complemented by a formal operator framework and multiple CPU/GPU pathways. They implement and evaluate these techniques on SPH workloads across four accelerator platforms, revealing substantial GPU speedups for memory-bound kernels and strong vendor-dependent behavior for precision reduction. The work demonstrates the potential of automated, architecture-aware data-layout and precision strategies while highlighting the need for adaptive, hardware-tuned approaches in future research.

Abstract

This study evaluates AoS-to-SoA transformations over reduced-precision data layouts for a particle simulation code on several GPU platforms: We hypothesize that SoA fits particularly well to SIMT, while AoS is the preferred storage format for many Lagrangian codes. Reduced-precision (below IEEE accuracy) is an established tool to address bandwidth constraints, although it remains unclear whether AoS and precision conversions should execute on a CPU or be deployed to a GPU if the compute kernel itself must run on an accelerator. On modern superchips where CPUs and GPUs share (logically) one data space, it is also unclear whether it is advantageous to stream data to the accelerator prior to the calculation, or whether we should let the accelerator transform data on demand, i.e.~work in-place logically. We therefore introduce compiler annotations to facilitate such conversions and to give the programmer the option to orchestrate the conversions in combination with GPU offloading. For some of our compute kernels of interest, Nvidia's G200 platforms yield a speedup of around 2.6 while AMD's MI300A exhibits more robust performance yet profits less. We assume that our compiler-based techniques are applicable to a wide variety of Lagrangian codes and beyond.

Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

TL;DR

Data movement is the bottleneck in heterogeneous HPC, motivating compiler-supported AoS-to-SoA transformations and reduced-precision storage. The authors introduce C++ annotations and a Clang/LLVM-based FrontendAction to enable on-the-fly data-layout changes and offloading, complemented by a formal operator framework and multiple CPU/GPU pathways. They implement and evaluate these techniques on SPH workloads across four accelerator platforms, revealing substantial GPU speedups for memory-bound kernels and strong vendor-dependent behavior for precision reduction. The work demonstrates the potential of automated, architecture-aware data-layout and precision strategies while highlighting the need for adaptive, hardware-tuned approaches in future research.

Abstract

This study evaluates AoS-to-SoA transformations over reduced-precision data layouts for a particle simulation code on several GPU platforms: We hypothesize that SoA fits particularly well to SIMT, while AoS is the preferred storage format for many Lagrangian codes. Reduced-precision (below IEEE accuracy) is an established tool to address bandwidth constraints, although it remains unclear whether AoS and precision conversions should execute on a CPU or be deployed to a GPU if the compute kernel itself must run on an accelerator. On modern superchips where CPUs and GPUs share (logically) one data space, it is also unclear whether it is advantageous to stream data to the accelerator prior to the calculation, or whether we should let the accelerator transform data on demand, i.e.~work in-place logically. We therefore introduce compiler annotations to facilitate such conversions and to give the programmer the option to orchestrate the conversions in combination with GPU offloading. For some of our compute kernels of interest, Nvidia's G200 platforms yield a speedup of around 2.6 while AMD's MI300A exhibits more robust performance yet profits less. We assume that our compiler-based techniques are applicable to a wide variety of Lagrangian codes and beyond.

Paper Structure

This paper contains 24 sections, 11 equations, 5 figures, 2 tables, 3 algorithms.

Figures (5)

  • Figure 6.1: AoS–SoA layout transformation speedup on GPU relative to the CPU baseline for the H100 node (top left), the MI200 node (top right), the GH200 node (bottom left), and the MI300A node (bottom right). The 16-bit variants use the reduced-precision SoA buffers[id=Ours], which store floating-point data as 16-bit deltas. Values above $1$ indicate that the GPU-based conversion is faster, whereas values below $1$ mean the CPU-based conversion is more performant.
  • Figure 6.2: SPH kernel compute speedup using SoA vs AoS layout for the H100 node (top left), the MI200 node (top right), the GH200 node (bottom left), and the MI300A node (bottom right). The 16-bit variants use the reduced-precision SoA buffers. Values above $1$ indicate that the SoA layout provides a speedup, whereas values below $1$ mean the vanilla AoS layout is more performant.
  • Figure 6.3: End-to-end speedup of GPU-based SoA vs AoS baseline for the H100 node (top left), the MI200 node (top right), the GH200 node (bottom left), and the MI300A node (bottom right). The GPU-based transformation is chosen as the more performant based on the results above. The 16-bit variants use the reduced-precision SoA buffers. Values above $1$ indicate that the AoS–SoA transformation provides an end-to-end speedup, whereas values below $1$ mean that the vanilla computation consuming AoS data is more performant.
  • Figure 7.1: [id=R1] Central region of the Noh problem in 2D at $t=0.1$ for simulation runs using full 64-bit storage precision (left), 32-bit reduced-precision (center), and 16-bit reduced precision (right) from our related workRadtke:2027:CPP. Truncation below 32-bit precision leads to non-physical symmetry breaking of the experiment.
  • Figure 7.2: [id=R1] Average per-particle acceleration error introduced by a single invocation of the Force kernel over increasingly more compressed floating-point data vs a full 64-bit precision reference. Vertical lines at $N=32$ and $N=16$ mark the transition points to the single and half precision IEEE layouts as the base for truncation, respectively.