Decoupled Control Flow and Data Access in RISC-V GPGPUs
Giuseppe M. Sarda, Nimish Shah, Abubakr Nada, Debjyoti Bhattacharjee, Marian Verhelst
TL;DR
The paper tackles performance bottlenecks in the open-source Vortex GPGPU by decoupling control flow from data access. It introduces a hardware Control Flow Manager (CFM) with hardware loops and a Loop Predication Stack (LPS) plus Decoupled Memory Streaming Lanes (DMSLs) and an enhanced memory subsystem, all configurable via CSRs. Empirical results show up to $8\times$ speedups and $10\times$ fewer dynamic instructions, with modest area penalties, making a single enhanced core competitive with multiple baseline cores. The contributions position Vortex as a practical, extensible platform for GPGPU and ML research.
Abstract
Vortex, a newly proposed open-source GPGPU platform based on the RISC-V ISA, offers a valid alternative for GPGPU research over the broadly-used modeling platforms based on commercial GPUs. Similarly to the push originating from the RISC-V movement for CPUs, Vortex can enable a myriad of fresh research directions for GPUs. However, as a young hardware platform, it currently lacks the performance competitiveness of commercial GPUs, which is crucial for widespread adoption. State-of-the-art GPUs, in fact, rely on complex architectural features, still unavailable in Vortex, to hide the micro-code overheads linked to control flow (CF) management and memory orchestration for data access. In particular, these components account for the majority of the dynamic instruction count in regular, memory-intensive kernels, such as linear algebra routines, which form the basis of many applications, including Machine Learning. To address these challenges with simple yet powerful micro-architecture modifications, this paper introduces decoupled CF and data access through 1.) a hardware CF manager to accelerate branching and predication in regular loop execution and 2.) decoupled memory streaming lanes to further hide memory latency with useful computation. The evaluation results for different kernels show 8$\times$ faster execution, 10$\times$ reduction in dynamic instruction count, and overall performance improvement from 0.35 to 1.63 $\mathrm{GFLOP/s/mm^2}$. Thanks to these enhancements, Vortex can become an ideal playground to enable GPGPU research for the next generation of Machine Learning.
