Table of Contents
Fetching ...

Efficient GPU Implementation of Particle Interactions with Cutoff Radius and Few Particles per Cell

David Algis, Berenger Bramas, Emmanuelle Darles, Lilian Aveneau

TL;DR

The paper tackles the challenge of efficiently computing pairwise interactions with a cutoff radius on GPUs when there are few particles per grid cell. It introduces two shared-memory strategies—All-in-SM (full sub-box loading) and X-pencil (X-oriented pencils)—and a lightweight shared-memory prefix-sum, evaluating their performance across three GPU architectures. The findings show that All-in-SM is often impractical due to occupancy constraints, while the X-pencil approach yields notable speedups in memory-bound scenarios (up to about 2.5x on some GPUs), with results highly sensitive to workload and device characteristics. Overall, the work demonstrates that data-reuse via shared memory can help in low-arithmetic-intensity regimes, though predicting the best strategy requires profiling and may vary by architecture; future work includes porting to other platforms such as Intel GPUs.

Abstract

This paper presents novel approaches to parallelizing particle interactions on a GPU when there are few particles per cell and the interactions are limited by a cutoff distance. The paper surveys classical algorithms and then introduces two alternatives that aim to utilize shared memory. The first approach copies the particles of a sub-box, while the second approach loads particles in a pencil along the X-axis. The different implementations are compared on three GPU models using Cuda and Hip. The results show that the X-pencil approach can provide a significant speedup but only in very specific cases.

Efficient GPU Implementation of Particle Interactions with Cutoff Radius and Few Particles per Cell

TL;DR

The paper tackles the challenge of efficiently computing pairwise interactions with a cutoff radius on GPUs when there are few particles per grid cell. It introduces two shared-memory strategies—All-in-SM (full sub-box loading) and X-pencil (X-oriented pencils)—and a lightweight shared-memory prefix-sum, evaluating their performance across three GPU architectures. The findings show that All-in-SM is often impractical due to occupancy constraints, while the X-pencil approach yields notable speedups in memory-bound scenarios (up to about 2.5x on some GPUs), with results highly sensitive to workload and device characteristics. Overall, the work demonstrates that data-reuse via shared memory can help in low-arithmetic-intensity regimes, though predicting the best strategy requires profiling and may vary by architecture; future work includes porting to other platforms such as Intel GPUs.

Abstract

This paper presents novel approaches to parallelizing particle interactions on a GPU when there are few particles per cell and the interactions are limited by a cutoff distance. The paper surveys classical algorithms and then introduces two alternatives that aim to utilize shared memory. The first approach copies the particles of a sub-box, while the second approach loads particles in a pencil along the X-axis. The different implementations are compared on three GPU models using Cuda and Hip. The results show that the X-pencil approach can provide a significant speedup but only in very specific cases.
Paper Structure (18 sections, 1 equation, 8 figures, 1 table, 6 algorithms)

This paper contains 18 sections, 1 equation, 8 figures, 1 table, 6 algorithms.

Figures (8)

  • Figure 1: Example of a 2x3 cell grid with 9 particles. We use an array that contains the number of particles per cell and use it to compute the prefix sum.
  • Figure 2: 2D example of sub-box configuration. In this case, consider that the maximum number of cells that can be loaded is $5\times4=20$. As it includes the ghost cells, the final target cells form a rectangle of $3\times2$ cells (Figure \ref{['subfig:cell-config']}). Each rectangle (sub-box in 3D) is assigned to a thread-block and is loaded with its neighbors in shared memory (Figure \ref{['subfig:sub-box-config']}). We have the guarantee that any cell contains at most $M_C$ particles, so we allocated enough memory to store $20 \times M_C$ particles.
  • Figure 3: 2D example of a local offset computation.
  • Figure 4: 2D example of the X-pencil. First, the X-pencil that covers the target particles is loaded in shared memory, and use for computation. Then, a loop in Y/Z will load the other pencils, one at a time, and use them for computation.
  • Figure 5: 2D example of the X-pencil-reg. First, the target particles are loaded in registers. Then, the source particles are loaded in shared memory (one pencil after the other). When possible, the source particles are copied from the registers. After each load of a pencil, the interactions are computed. It is possible that some target cells are not involved in the computation for some iterations, and the threads are idle, depending on the pencil's position.
  • ...and 3 more figures