Table of Contents
Fetching ...

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

Johansell Villalobos, Josef Ruzicka, Silvio Rizzi

TL;DR

The paper addresses the challenge of performance portability for exascale HPC by porting two representative miniapps—the N-body Lennard-Jones system and a 2D structured grid Navier–Stokes-like model—onto four frameworks: Kokkos, RAJA, OCCA, and OpenMP. Using MPI on a single Polaris node with four NVIDIA A100 GPUs, the authors provide preliminary time-to-solution results, revealing substantial performance variability across frameworks and problem sizes, with OCCA excelling for small problems due to JIT compilation but lagging in reductions, and OpenMP underperforming in the structured grid case due to communication overhead. The work offers concrete insights into framework trade-offs, emphasizing the need for optimized reductions, data communication, and memory management to maximize portability without sacrificing performance. These findings inform framework selection and optimization priorities for developers targeting heterogeneous architectures and provide a foundation for future scalability studies and rigorous statistical benchmarking.

Abstract

Scientific computing in the exascale era demands increased computational power to solve complex problems across various domains. With the rise of heterogeneous computing architectures the need for vendor-agnostic, performance portability frameworks has been highlighted. Libraries like Kokkos have become essential for enabling high-performance computing applications to execute efficiently across different hardware platforms with minimal code changes. In this direction, this paper presents preliminary time-to-solution results for two representative scientific computing applications: an N-body simulation and a structured grid simulation. Both applications used a distributed memory approach and hardware acceleration through four performance portability frameworks: Kokkos, OpenMP, RAJA, and OCCA. Experiments conducted on a single node of the Polaris supercomputer using four NVIDIA A100 GPUs revealed significant performance variability among frameworks. OCCA demonstrated faster execution times for small-scale validation problems, likely due to JIT compilation, however its lack of optimized reduction algorithms may limit scalability for larger simulations while using its out of the box API. OpenMP performed poorly in the structured grid simulation most likely due to inefficiencies in inter-node data synchronization and communication. These findings highlight the need for further optimization to maximize each framework's capabilities. Future work will focus on enhancing reduction algorithms, data communication, memory management, as wells as performing scalability studies, and a comprehensive statistical analysis to evaluate and compare framework performance.

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

TL;DR

The paper addresses the challenge of performance portability for exascale HPC by porting two representative miniapps—the N-body Lennard-Jones system and a 2D structured grid Navier–Stokes-like model—onto four frameworks: Kokkos, RAJA, OCCA, and OpenMP. Using MPI on a single Polaris node with four NVIDIA A100 GPUs, the authors provide preliminary time-to-solution results, revealing substantial performance variability across frameworks and problem sizes, with OCCA excelling for small problems due to JIT compilation but lagging in reductions, and OpenMP underperforming in the structured grid case due to communication overhead. The work offers concrete insights into framework trade-offs, emphasizing the need for optimized reductions, data communication, and memory management to maximize portability without sacrificing performance. These findings inform framework selection and optimization priorities for developers targeting heterogeneous architectures and provide a foundation for future scalability studies and rigorous statistical benchmarking.

Abstract

Scientific computing in the exascale era demands increased computational power to solve complex problems across various domains. With the rise of heterogeneous computing architectures the need for vendor-agnostic, performance portability frameworks has been highlighted. Libraries like Kokkos have become essential for enabling high-performance computing applications to execute efficiently across different hardware platforms with minimal code changes. In this direction, this paper presents preliminary time-to-solution results for two representative scientific computing applications: an N-body simulation and a structured grid simulation. Both applications used a distributed memory approach and hardware acceleration through four performance portability frameworks: Kokkos, OpenMP, RAJA, and OCCA. Experiments conducted on a single node of the Polaris supercomputer using four NVIDIA A100 GPUs revealed significant performance variability among frameworks. OCCA demonstrated faster execution times for small-scale validation problems, likely due to JIT compilation, however its lack of optimized reduction algorithms may limit scalability for larger simulations while using its out of the box API. OpenMP performed poorly in the structured grid simulation most likely due to inefficiencies in inter-node data synchronization and communication. These findings highlight the need for further optimization to maximize each framework's capabilities. Future work will focus on enhancing reduction algorithms, data communication, memory management, as wells as performing scalability studies, and a comprehensive statistical analysis to evaluate and compare framework performance.

Paper Structure

This paper contains 16 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: MPI Ring exchange communication pattern implemented for the N-body simulation.
  • Figure 2: MPI halo exchange communication pattern implemented for the structure grid simulation.
  • Figure 3: Execution time results for the N-Body simulation across frameworks including reductions for energy calculations. Verlet integration, boundary checking and I/O are included as Other Ops.
  • Figure 4: Execution time results for the N-Body simulation across frameworks without reductions for energy calculations. Verlet integration, boundary checking and I/O are included as Other Ops.
  • Figure 5: Execution time results for the vorticity simulation across frameworks. Euler integration, boundary checking, and I/O times are negligible with respect to the halo exchange and Jacobi kernel times during the solution of the Poisson equation.