The PLUTO Code on GPUs: Offloading Lagrangian Particle Methods

Alessio Suriano; Stefano Truzzi; Agnese Costa; Marco Rossazza; Nitin Shukla; Andrea Mignone; Vittoria Berta; Claudio Zanni

The PLUTO Code on GPUs: Offloading Lagrangian Particle Methods

Alessio Suriano, Stefano Truzzi, Agnese Costa, Marco Rossazza, Nitin Shukla, Andrea Mignone, Vittoria Berta, Claudio Zanni

TL;DR

This work presents the GPU-compatible C++ re-design of the LP module, that, by means of the programming model OpenACC and the Message Passing Interface library, is capable of targeting both single commercial GPUs as well as multi-node (pre-exascale computing facilities.

Abstract

The Lagrangian Particles (LP) module of the PLUTO code offers a powerful simulation tool to predict the non-thermal emission produced by shock accelerated particles in large-scale relativistic magnetized astrophysics flows. The LPs represent ensembles of relativistic particles with a given energy distribution which is updated by solving the relativistic cosmic ray transport equation. The approach consistently includes the effects of adiabatic expansion, synchrotron and inverse Compton emission. The large scale nature of such systems creates boundless computational demand which can only be satisfied by targeting modern computing hardware such as Graphic Processing Units (GPUs). In this work we presents the GPU-compatible C++ re-design of the LP module, that, by means of the programming model OpenACC and the Message Passing Interface library, is capable of targeting both single commercial GPUs as well as multi-node (pre-)exascale computing facilities. The code has been benchmarked up to 28672 parallel CPUs cores and 1024 parallel GPUs demonstrating $\sim(80-90)\%$ weak scaling parallel efficiency and good strong scaling capabilities. Our results demonstrated a speedup of $6$ times when solving that same benchmark test with 128 full GPU nodes (4GPUs per node) against the same amount of full high-end CPU nodes (112 cores per node). Furthermore, we conducted a code verification by comparing its prediction to corresponding analytical solutions for two test cases. We note that this work is part of broader project that aims at developing gPLUTO, the novel and revised GPU-ready implementation of its legacy.

The PLUTO Code on GPUs: Offloading Lagrangian Particle Methods

TL;DR

Abstract

weak scaling parallel efficiency and good strong scaling capabilities. Our results demonstrated a speedup of

times when solving that same benchmark test with 128 full GPU nodes (4GPUs per node) against the same amount of full high-end CPU nodes (112 cores per node). Furthermore, we conducted a code verification by comparing its prediction to corresponding analytical solutions for two test cases. We note that this work is part of broader project that aims at developing gPLUTO, the novel and revised GPU-ready implementation of its legacy.

Paper Structure (23 sections, 14 equations, 8 figures, 7 tables)

This paper contains 23 sections, 14 equations, 8 figures, 7 tables.

Introduction
Numerical methods
Exploiting Graphic Processing Units
Programming model
Performance optimization strategies
Reduce host-device data transfer
Minimize memory allocation/deallocation
Ensure coalesced access to memory
Multi-core communication
Code sections overview
Numerical Benchmarks and Performance Assessment
Test Cases
Simple Advection
Stationary Planar Parallel Shock
Scalability
...and 8 more sections

Figures (8)

Figure 1: Errors for the particle advection problem at $t=10$, computed as the absolute relative difference between the LP position $x_{\rm LP}^\prime$ and the expected value $x_{\rm th}^\prime$, obtained from Eq. (\ref{['eq:advection_exact_sol']}). Red and blue solid lines correspond, respectively, to the $2^{\rm nd}$ and $3^{\rm rd}$ order Runge-Kutta algorithms, while dashed lines gives the expected accuracy.
Figure 2: Initial (blue) and post--shock (orange) spectra of a LP that crosses the discontinuity. The green dashed line represents a power--law with index $q=2.14$ correspondent to the slope predicted by equation \ref{['eq:slope']} for an MHD shock with the compression ratio imposed in the present initial configuration. Note that the high--energy tail of the orange curve presents a deviation from a straight line as an effect of the synchrotron cooling induced by the magnetic field.
Figure 3: CPU speedup on Marenostrum 5 GPP partition. The red and orange solid lines represent the measured speedup for the advection and the shock test problem respectively. The grey dotted line is the theoretical speedup correspondent to a linear inverse relation of the time to solution with the number of cores employed.
Figure 4: GPU strong scaling speedup on Marenostrum 5 (left panel) and Leonardo Booster (right panel). Each pair of curves with the same marker and the same colour refers to a given problem size of Table \ref{['tab:gpustrong']}. The solid lines refer to problem $\mathcal{A}$ whereas the dashed lines are related to the problem $\mathcal{S}$. The grey dotted line is the theoretical speedup.
Figure 5: Weak scaling parallel efficiency up to 256 nodes of test $\mathcal{A}$ the advection (red line) and test $\mathcal{S}$ (orange line) measured on the GPP partition of Marenostrum 5.
...and 3 more figures

The PLUTO Code on GPUs: Offloading Lagrangian Particle Methods

TL;DR

Abstract

The PLUTO Code on GPUs: Offloading Lagrangian Particle Methods

Authors

TL;DR

Abstract

Table of Contents

Figures (8)