Preparing for HPC on RISC-V: Examining Vectorization and Distributed Performance of an Astrophyiscs Application with HPX and Kokkos

Patrick Diehl; Panagiotis Syskakis; Gregor Daiß; Steven R. Brandt; Alireza Kheirkhahan; Srinivas Yadav Singanaboina; Dominic Marcello; Chris Taylor; John Leidel; Hartmut Kaiser

Preparing for HPC on RISC-V: Examining Vectorization and Distributed Performance of an Astrophyiscs Application with HPX and Kokkos

Patrick Diehl, Panagiotis Syskakis, Gregor Daiß, Steven R. Brandt, Alireza Kheirkhahan, Srinivas Yadav Singanaboina, Dominic Marcello, Chris Taylor, John Leidel, Hartmut Kaiser

TL;DR

This work assesses the viability of HPC on desktop-grade RISC-V hardware by porting the astrophysics code Octo-Tiger to a RISC-V+HPX+Kokkos stack and introducing a RISC-V RVV backend for std::experimental::simd. It demonstrates how RVV vectorization and HPX/Kokkos integration enable scalable performance on a two-node MILK-V Pioneer cluster and a Banana Pi board, with cross-comparisons to the A64FX-based Fugaku system. Key contributions include a practical RVV library implementation, targeted HPX optimizations for RISC-V atomics, and detailed node- and distributed-scale performance and power measurements across multiple real-world astrophysical scenarios (DWD, v1309). The results indicate that RISC-V hardware can approach or exceed certain performance metrics of contemporary ARM-based HPC nodes while offering lower power consumption, supporting cautious optimism for RISC-V as a viable HPC platform and guiding future heterogeneous and vector-enabled developments.

Abstract

In recent years, interest in RISC-V computing architectures has moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a concern. As of this year, the first single board RISC-V CPUs implementing the finalized ratified vector specification are being released. The RISC-V vector specification follows in the tradition of vector processors found in the CDC STAR-100, the Cray-1, the Convex C-Series, and the NEC SX machines and accelerators. The family of vector processors offers support for variable-length array processing as opposed to the fixed-length processing functionality offered by SIMD. Vector processors offer opportunities to perform vector-chaining which allows temporary results to be used without the need to resolve memory references. In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D adaptive mesh refinement astrophysics application to study these early RISC-V chips with vector machine support. We report on our experience in porting this modern C++ code (which is built upon several open-source libraries such as HPX and Kokkos) to RISC-V. In addition, we show the impact of the RISC-V Vector extension on a RISC-V single board computer by implementing the std::experimental:simd interface and integrating it with our code. We also compare the application's performance, scalability, and power consumption on desktop-grade RISC-V computer to an A64FX system.

Preparing for HPC on RISC-V: Examining Vectorization and Distributed Performance of an Astrophyiscs Application with HPX and Kokkos

TL;DR

Abstract

Paper Structure (25 sections, 8 figures, 5 tables)

This paper contains 25 sections, 8 figures, 5 tables.

Introduction
Related Work
Software stack
HPX
Kokkos and HPX-Kokkos
RISC-V Vector (RVV) Library
Octo-Tiger
Scientific applications
Double White Dwarf Systems
v1309
In-House RISC-V Cluster
MILK-V Pioneer
Banana Pi BPI-F3
Performance results
Effect of scalable vector extensions (Banana Pi BPI-F3)
...and 10 more sections

Figures (8)

Figure 1: Image of one of the MILK-V Pioneer nodes of the in-house cluster. Each node has a 64-core SOPHON SG2042 RISC-V CPU and 128 GB DDR4 System Memory.
Figure 2: Single node scaling for a rotating star on Banana Pi BPI-F3 using scalar values and RISC-V vector extensions.
Figure 3: Single node scaling for a rotating star on ARM A64FX and RISC-V.
Figure 4: Single node scaling for \ref{['fig:dwd:single:beginning']} DWD Separated and \ref{['fig:dwd:single:refined']} DWD Merging, respectively. For DWD Merging with 11 levels, we only used the optimized code. The run on 8 cores took around 34 hours and we skipped the runs on 4 cores, 2 cores, and a single core since these runs were not feasible.
Figure 5: Distributed scaling for a single node and two nodes using MPI for communication on RISC-V and Supercomputer Fugaku. Unfortunately, our in-house cluster only had two nodes. Note that we used all cores of the nodes. Recall that each A64FX node has 48 cores and each RISC-V node has 64 cores.
...and 3 more figures

Preparing for HPC on RISC-V: Examining Vectorization and Distributed Performance of an Astrophyiscs Application with HPX and Kokkos

TL;DR

Abstract

Preparing for HPC on RISC-V: Examining Vectorization and Distributed Performance of an Astrophyiscs Application with HPX and Kokkos

Authors

TL;DR

Abstract

Table of Contents

Figures (8)