Table of Contents
Fetching ...

Decoupled Control Flow and Data Access in RISC-V GPGPUs

Giuseppe M. Sarda, Nimish Shah, Abubakr Nada, Debjyoti Bhattacharjee, Marian Verhelst

TL;DR

The paper tackles performance bottlenecks in the open-source Vortex GPGPU by decoupling control flow from data access. It introduces a hardware Control Flow Manager (CFM) with hardware loops and a Loop Predication Stack (LPS) plus Decoupled Memory Streaming Lanes (DMSLs) and an enhanced memory subsystem, all configurable via CSRs. Empirical results show up to $8\times$ speedups and $10\times$ fewer dynamic instructions, with modest area penalties, making a single enhanced core competitive with multiple baseline cores. The contributions position Vortex as a practical, extensible platform for GPGPU and ML research.

Abstract

Vortex, a newly proposed open-source GPGPU platform based on the RISC-V ISA, offers a valid alternative for GPGPU research over the broadly-used modeling platforms based on commercial GPUs. Similarly to the push originating from the RISC-V movement for CPUs, Vortex can enable a myriad of fresh research directions for GPUs. However, as a young hardware platform, it currently lacks the performance competitiveness of commercial GPUs, which is crucial for widespread adoption. State-of-the-art GPUs, in fact, rely on complex architectural features, still unavailable in Vortex, to hide the micro-code overheads linked to control flow (CF) management and memory orchestration for data access. In particular, these components account for the majority of the dynamic instruction count in regular, memory-intensive kernels, such as linear algebra routines, which form the basis of many applications, including Machine Learning. To address these challenges with simple yet powerful micro-architecture modifications, this paper introduces decoupled CF and data access through 1.) a hardware CF manager to accelerate branching and predication in regular loop execution and 2.) decoupled memory streaming lanes to further hide memory latency with useful computation. The evaluation results for different kernels show 8$\times$ faster execution, 10$\times$ reduction in dynamic instruction count, and overall performance improvement from 0.35 to 1.63 $\mathrm{GFLOP/s/mm^2}$. Thanks to these enhancements, Vortex can become an ideal playground to enable GPGPU research for the next generation of Machine Learning.

Decoupled Control Flow and Data Access in RISC-V GPGPUs

TL;DR

The paper tackles performance bottlenecks in the open-source Vortex GPGPU by decoupling control flow from data access. It introduces a hardware Control Flow Manager (CFM) with hardware loops and a Loop Predication Stack (LPS) plus Decoupled Memory Streaming Lanes (DMSLs) and an enhanced memory subsystem, all configurable via CSRs. Empirical results show up to speedups and fewer dynamic instructions, with modest area penalties, making a single enhanced core competitive with multiple baseline cores. The contributions position Vortex as a practical, extensible platform for GPGPU and ML research.

Abstract

Vortex, a newly proposed open-source GPGPU platform based on the RISC-V ISA, offers a valid alternative for GPGPU research over the broadly-used modeling platforms based on commercial GPUs. Similarly to the push originating from the RISC-V movement for CPUs, Vortex can enable a myriad of fresh research directions for GPUs. However, as a young hardware platform, it currently lacks the performance competitiveness of commercial GPUs, which is crucial for widespread adoption. State-of-the-art GPUs, in fact, rely on complex architectural features, still unavailable in Vortex, to hide the micro-code overheads linked to control flow (CF) management and memory orchestration for data access. In particular, these components account for the majority of the dynamic instruction count in regular, memory-intensive kernels, such as linear algebra routines, which form the basis of many applications, including Machine Learning. To address these challenges with simple yet powerful micro-architecture modifications, this paper introduces decoupled CF and data access through 1.) a hardware CF manager to accelerate branching and predication in regular loop execution and 2.) decoupled memory streaming lanes to further hide memory latency with useful computation. The evaluation results for different kernels show 8 faster execution, 10 reduction in dynamic instruction count, and overall performance improvement from 0.35 to 1.63 . Thanks to these enhancements, Vortex can become an ideal playground to enable GPGPU research for the next generation of Machine Learning.

Paper Structure

This paper contains 20 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Vortex performance on various kernels, normalized to peak performance (top); roots for this performance degradation coupled to the proposed innovations targeted in this paper (bottom).
  • Figure 2: The OpenCL vecadd code compiled in RISC-V assembly.
  • Figure 3: System level integration and extension units interaction with the GPGPU pipeline. The CF Manager (CFM) is placed in the fetch stage, while Decoupled Memory Streaming Lanes (DMSLs) sit in the issue stage.
  • Figure 4: Hardware loops unit microarchitecture. The unit removes loop CF overhead by accelerating loop branch instructions and incrementing the loop iteration counter at the fetch stage.
  • Figure 5: Loop predication stack (LPS) microarchitecture. The stack removes predication overhead by applying fine-grain control over active threads in nested loops.
  • ...and 4 more figures