Table of Contents
Fetching ...

Late Breaking Results: A RISC-V ISA Extension for Chaining in Scalar Processors

Luca Colagrande, Jayanth Jonnalagadda, Luca Benini

TL;DR

The paper addresses data hazards in in-order scalar processors used in energy-constrained accelerators. It introduces scalar chaining, a hardware-software technique that stores intermediate results in pipeline registers and uses a per-register enable mask (via a 0x7C3 control) to enable chaining, achieving FIFO-like dataflow without increasing architectural register pressure. The approach is implemented on a Snitch RISC-V in-order core in a 12LP+ FinFET process, with negligible area overhead and only minor frequency impact. Experiments on stencil workloads report FPU utilization >93 percent, about 4% speedup, and ~10% energy-efficiency gains over strong baselines, demonstrating practical viability for energy-efficient programmable accelerators.

Abstract

Modern general-purpose accelerators integrate a large number of programmable area- and energy-efficient processing elements (PEs), to deliver high performance while meeting stringent power delivery and thermal dissipation constraints. In this context, PEs are often implemented by scalar in-order cores, which are highly sensitive to pipeline stalls. Traditional software techniques, such as loop unrolling, mitigate the issue at the cost of increased register pressure, limiting flexibility. We propose scalar chaining, a novel hardware-software solution, to address this issue without incurring the drawbacks of traditional software-only techniques. We demonstrate our solution on register-limited stencil codes, achieving >93% FPU utilizations and a 4% speedup and 10% higher energy efficiency, on average, over highly-optimized baselines. Our implementation is fully open source and performance experiments are reproducible using free software.

Late Breaking Results: A RISC-V ISA Extension for Chaining in Scalar Processors

TL;DR

The paper addresses data hazards in in-order scalar processors used in energy-constrained accelerators. It introduces scalar chaining, a hardware-software technique that stores intermediate results in pipeline registers and uses a per-register enable mask (via a 0x7C3 control) to enable chaining, achieving FIFO-like dataflow without increasing architectural register pressure. The approach is implemented on a Snitch RISC-V in-order core in a 12LP+ FinFET process, with negligible area overhead and only minor frequency impact. Experiments on stencil workloads report FPU utilization >93 percent, about 4% speedup, and ~10% energy-efficiency gains over strong baselines, demonstrating practical viability for energy-efficient programmable accelerators.

Abstract

Modern general-purpose accelerators integrate a large number of programmable area- and energy-efficient processing elements (PEs), to deliver high performance while meeting stringent power delivery and thermal dissipation constraints. In this context, PEs are often implemented by scalar in-order cores, which are highly sensitive to pipeline stalls. Traditional software techniques, such as loop unrolling, mitigate the issue at the cost of increased register pressure, limiting flexibility. We propose scalar chaining, a novel hardware-software solution, to address this issue without incurring the drawbacks of traditional software-only techniques. We demonstrate our solution on register-limited stencil codes, achieving >93% FPU utilizations and a 4% speedup and 10% higher energy efficiency, on average, over highly-optimized baselines. Our implementation is fully open source and performance experiments are reproducible using free software.

Paper Structure

This paper contains 4 sections, 4 figures.

Figures (4)

  • Figure 1:
  • Figure 3:
  • Figure 5: Block diagram of Snitch's subsystem, illustrating the dataflow associated to the trace in \ref{['fig:chaining']}. Elements in different colors represent snapshots at different moments in time, particularly at the respectively-colored issue slots in \ref{['fig:chaining']}. Numbered tokens represent the outputs of the instructions at the respectively-numbered issue slots in \ref{['fig:chaining']}.
  • Figure 6: FPU utilization (left) and power consumption [mW] (right) for all code variants.