Efficient GPU Implementation of Particle Interactions with Cutoff Radius and Few Particles per Cell
David Algis, Berenger Bramas, Emmanuelle Darles, Lilian Aveneau
TL;DR
The paper tackles the challenge of efficiently computing pairwise interactions with a cutoff radius on GPUs when there are few particles per grid cell. It introduces two shared-memory strategies—All-in-SM (full sub-box loading) and X-pencil (X-oriented pencils)—and a lightweight shared-memory prefix-sum, evaluating their performance across three GPU architectures. The findings show that All-in-SM is often impractical due to occupancy constraints, while the X-pencil approach yields notable speedups in memory-bound scenarios (up to about 2.5x on some GPUs), with results highly sensitive to workload and device characteristics. Overall, the work demonstrates that data-reuse via shared memory can help in low-arithmetic-intensity regimes, though predicting the best strategy requires profiling and may vary by architecture; future work includes porting to other platforms such as Intel GPUs.
Abstract
This paper presents novel approaches to parallelizing particle interactions on a GPU when there are few particles per cell and the interactions are limited by a cutoff distance. The paper surveys classical algorithms and then introduces two alternatives that aim to utilize shared memory. The first approach copies the particles of a sub-box, while the second approach loads particles in a pencil along the X-axis. The different implementations are compared on three GPU models using Cuda and Hip. The results show that the X-pencil approach can provide a significant speedup but only in very specific cases.
