Table of Contents
Fetching ...

Is RISC-V Ready for Machine Learning? Portable Gaussian Processes Using Asynchronous Tasks

Alexander Strack, Patrick Diehl, Dirk Pflüger

Abstract

Gaussian processes are widely used in machine learning domains but remain computationally demanding, limiting their efficient scalability across diverse hardware platforms. The GPRat library targets these challenges with the help of the asynchronous many-task runtime system HPX. In this work, we extend GPRat to enable portability across multiple hardware architectures and evaluate its performance on representative x86-64, ARM, and RISC-V chips. We conduct node-level strong-scaling and problem-size-scaling benchmarks for Gaussian Process prediction and hyperparameter optimization to assess single-core performance, parallel scalability, and architectural efficiency. Our results show that while the x86-64 Zen 2 chip achieves a 58% single-core performance advantage over the ARM-based Fujitsu A64FX, superior parallel scaling allows the 48-core ARM chip to outperform the 64-core Zen 2 by 9% at full node utilization. The evaluated SOPHON SG2042 RISC-V chip exhibits substantially lower performance and weaker scalability, with single-core performance lagging by up to a factor of 14 and large-scale parallel workloads showing slowdowns of up to a factor of 25. For problem-size scaling, ARM and x86-64 systems demonstrate comparable performance within 25%. These findings highlight the growing competitiveness of ARM-based processors and emphasize the importance of wide-register vectorization support and memory subsystem improvements for upcoming RISC-V platforms.

Is RISC-V Ready for Machine Learning? Portable Gaussian Processes Using Asynchronous Tasks

Abstract

Gaussian processes are widely used in machine learning domains but remain computationally demanding, limiting their efficient scalability across diverse hardware platforms. The GPRat library targets these challenges with the help of the asynchronous many-task runtime system HPX. In this work, we extend GPRat to enable portability across multiple hardware architectures and evaluate its performance on representative x86-64, ARM, and RISC-V chips. We conduct node-level strong-scaling and problem-size-scaling benchmarks for Gaussian Process prediction and hyperparameter optimization to assess single-core performance, parallel scalability, and architectural efficiency. Our results show that while the x86-64 Zen 2 chip achieves a 58% single-core performance advantage over the ARM-based Fujitsu A64FX, superior parallel scaling allows the 48-core ARM chip to outperform the 64-core Zen 2 by 9% at full node utilization. The evaluated SOPHON SG2042 RISC-V chip exhibits substantially lower performance and weaker scalability, with single-core performance lagging by up to a factor of 14 and large-scale parallel workloads showing slowdowns of up to a factor of 25. For problem-size scaling, ARM and x86-64 systems demonstrate comparable performance within 25%. These findings highlight the growing competitiveness of ARM-based processors and emphasize the importance of wide-register vectorization support and memory subsystem improvements for upcoming RISC-V platforms.

Paper Structure

This paper contains 11 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Strong scaling runtimes for hyperparameter optimization on up to $64$ x86-64, ARM and RISC-V cores. The problem size was set to $N$$=$$2^{13}$ training samples. GPRat uses 16 tiles per dimension.
  • Figure 2: Strong scaling runtimes for prediction with full covariance matrix on up to $64$ x86-64, ARM, and RISC-V cores. The problem size was set to $N$$=$$M$$=$$2^{13}$ training samples and test samples. GPRat uses 16 tiles per dimension.
  • Figure 3: Problem size scaling runtimes for hyperparameter optimization on $64$ x86-64, $48$ ARM, and $32$ RISC-V cores. The tile size per dimension was set dynamically to one, four, and $16$.
  • Figure 4: Problem size scaling runtimes for prediction with full covariance matrix on $64$ x86-64, $48$ ARM, and $32$ RISC-V cores. The number of tiles per dimension was set dynamically to one, four, and $16$.