Table of Contents
Fetching ...

Work-In-Progress: Accelerating Numpy With OpenBLAS For Open-Source RISC-V Chips

Cyril Koenig, Enrico Zelioli, Frank K. Gürkaynak, Luca Benini

TL;DR

This work tackles the challenge of accelerating high-level linear algebra workloads on open-source RISC-V platforms by offloading selected OpenBLAS kernels to a programmable manycore accelerator (PMCA) via OpenMP, enabling Numpy-driven Python applications to run efficiently on heterogeneous RISCV systems. The authors extend OpenBLAS with heterogeneous rv64/rv32 kernels using the HeroSDK, integrate it with Numpy, and run tests on a rv64g host paired with a rv32imafd PMCA on an FPGA-emulated platform. The key findings include a $2.71\times$ speedup for $128$-sized matrix multiplications, with data movement representing a substantial portion of runtime and potential gains from zero-copy offloading and IO-page optimizations, projected to reach up to $4.7\times$ in future work. This demonstrates a practical pathway to accelerate high-level applications on open-source heterogeneous RISCV SoCs, potentially enabling more efficient ML workflows on embedded platforms.

Abstract

RISC-V allows for building general-purpose computing platforms with programmable accelerators around a single open-source ISA. However, leveraging heterogeneous SoCs within high-level applications is a tedious task. In this preliminary work, we modify the OpenBLAS library to offload selected linear kernels to a programmable manycore accelerator (PMCA) using OpenMP. By linking the Python package Numpy against this library, we enable acceleration of high-level applications. We target an open-source heterogeneous System-on-Chip with a rv64g Linux capable host and a rv32imafd PMCA. Using this platform emulated on FPGA, and the presented software stack, we can accelerate Phyton applications with linear algebra operators like matrix multiplication.

Work-In-Progress: Accelerating Numpy With OpenBLAS For Open-Source RISC-V Chips

TL;DR

This work tackles the challenge of accelerating high-level linear algebra workloads on open-source RISC-V platforms by offloading selected OpenBLAS kernels to a programmable manycore accelerator (PMCA) via OpenMP, enabling Numpy-driven Python applications to run efficiently on heterogeneous RISCV systems. The authors extend OpenBLAS with heterogeneous rv64/rv32 kernels using the HeroSDK, integrate it with Numpy, and run tests on a rv64g host paired with a rv32imafd PMCA on an FPGA-emulated platform. The key findings include a speedup for -sized matrix multiplications, with data movement representing a substantial portion of runtime and potential gains from zero-copy offloading and IO-page optimizations, projected to reach up to in future work. This demonstrates a practical pathway to accelerate high-level applications on open-source heterogeneous RISCV SoCs, potentially enabling more efficient ML workflows on embedded platforms.

Abstract

RISC-V allows for building general-purpose computing platforms with programmable accelerators around a single open-source ISA. However, leveraging heterogeneous SoCs within high-level applications is a tedious task. In this preliminary work, we modify the OpenBLAS library to offload selected linear kernels to a programmable manycore accelerator (PMCA) using OpenMP. By linking the Python package Numpy against this library, we enable acceleration of high-level applications. We target an open-source heterogeneous System-on-Chip with a rv64g Linux capable host and a rv32imafd PMCA. Using this platform emulated on FPGA, and the presented software stack, we can accelerate Phyton applications with linear algebra operators like matrix multiplication.

Paper Structure

This paper contains 8 sections, 3 figures.

Figures (3)

  • Figure 1: Open-Source heterogeneous platform with CVA6 and Snitch. The L1 spm contains the device local data, the dual-port L2 spm contains constants and device instructions, and the device dram contains physically contiguous buffers for shared data structures.
  • Figure 2: The proposed software architecture. The Hero library ① contains device managements functions. The OpenMP target library ② contains the callbacks for the OpenMP API. The OpenBLAS library ③ contains computing kernels for host and/or device. The Numpy ④ package is linked against OpenBLAS. Finally, the user application ⑤ imports Numpy.
  • Figure 3: Execution time (measured from Python) for a $float64$ matrix multiplication with and without offloading.