Work-In-Progress: Accelerating Numpy With OpenBLAS For Open-Source RISC-V Chips
Cyril Koenig, Enrico Zelioli, Frank K. Gürkaynak, Luca Benini
TL;DR
This work tackles the challenge of accelerating high-level linear algebra workloads on open-source RISC-V platforms by offloading selected OpenBLAS kernels to a programmable manycore accelerator (PMCA) via OpenMP, enabling Numpy-driven Python applications to run efficiently on heterogeneous RISCV systems. The authors extend OpenBLAS with heterogeneous rv64/rv32 kernels using the HeroSDK, integrate it with Numpy, and run tests on a rv64g host paired with a rv32imafd PMCA on an FPGA-emulated platform. The key findings include a $2.71\times$ speedup for $128$-sized matrix multiplications, with data movement representing a substantial portion of runtime and potential gains from zero-copy offloading and IO-page optimizations, projected to reach up to $4.7\times$ in future work. This demonstrates a practical pathway to accelerate high-level applications on open-source heterogeneous RISCV SoCs, potentially enabling more efficient ML workflows on embedded platforms.
Abstract
RISC-V allows for building general-purpose computing platforms with programmable accelerators around a single open-source ISA. However, leveraging heterogeneous SoCs within high-level applications is a tedious task. In this preliminary work, we modify the OpenBLAS library to offload selected linear kernels to a programmable manycore accelerator (PMCA) using OpenMP. By linking the Python package Numpy against this library, we enable acceleration of high-level applications. We target an open-source heterogeneous System-on-Chip with a rv64g Linux capable host and a rv32imafd PMCA. Using this platform emulated on FPGA, and the presented software stack, we can accelerate Phyton applications with linear algebra operators like matrix multiplication.
