Table of Contents
Fetching ...

Instruction Scheduling in the Saturn Vector Unit

Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic

TL;DR

Saturn addresses the inefficiency of long-vector designs in mobile and edge contexts by delivering a full RVV 1.0-compliant short-vector vector unit with fine-grained vector chaining and decoupled memory paths. The core methodology combines explicit per-element-group hazard tracking, a compact backend with limited OoO sequencing, and a decoupled load-store path to enable run-ahead memory and high datapath utilization for short vectors. Key contributions include the Saturn RTL (Chisel) implementation, comprehensive area/power/performance evaluation, and a detailed analysis of design parameters such as chime length, issue queue depth, and memory latency. The results show Saturn achieving competitive power and area while delivering high utilization across diverse workloads, illustrating the practicality of compact, scalable vector units for mobile and embedded applications.

Abstract

While the challenges and solutions for efficient execution of scalable vector ISAs on long-vector-length microarchitectures have been well established, not all of these solutions are suitable for short-vector-length implementations. This work proposes a novel microarchitecture for instruction sequencing in vector units with short architectural vector lengths. The proposed microarchitecture supports fine-granularity chaining, multi-issue out-of-order execution, zero dead-time, and run-ahead memory accesses with low area or complexity costs. We present the Saturn Vector Unit, a RTL implementation of a RVV vector unit. With our instruction scheduling mechanism, Saturn exhibits comparable or superior power, performance, and area characteristics compared to state-of-the-art long-vector and short-vector implementations.

Instruction Scheduling in the Saturn Vector Unit

TL;DR

Saturn addresses the inefficiency of long-vector designs in mobile and edge contexts by delivering a full RVV 1.0-compliant short-vector vector unit with fine-grained vector chaining and decoupled memory paths. The core methodology combines explicit per-element-group hazard tracking, a compact backend with limited OoO sequencing, and a decoupled load-store path to enable run-ahead memory and high datapath utilization for short vectors. Key contributions include the Saturn RTL (Chisel) implementation, comprehensive area/power/performance evaluation, and a detailed analysis of design parameters such as chime length, issue queue depth, and memory latency. The results show Saturn achieving competitive power and area while delivering high utilization across diverse workloads, illustrating the practicality of compact, scalable vector units for mobile and embedded applications.

Abstract

While the challenges and solutions for efficient execution of scalable vector ISAs on long-vector-length microarchitectures have been well established, not all of these solutions are suitable for short-vector-length implementations. This work proposes a novel microarchitecture for instruction sequencing in vector units with short architectural vector lengths. The proposed microarchitecture supports fine-granularity chaining, multi-issue out-of-order execution, zero dead-time, and run-ahead memory accesses with low area or complexity costs. We present the Saturn Vector Unit, a RTL implementation of a RVV vector unit. With our instruction scheduling mechanism, Saturn exhibits comparable or superior power, performance, and area characteristics compared to state-of-the-art long-vector and short-vector implementations.

Paper Structure

This paper contains 34 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of the Saturn short-vector microarchitecture (gray) and its intended integration into an in-order core. The gradated regions indicates components relevant to the scheduling mechanism.
  • Figure 2: Pipeline stages of Saturn when integrated into a host in-order RISC-V core. Hatched lines indicate queues between stages.
  • Figure 3: The vector load and store paths handle variable-chime and long-latency memory operations with minimal storage requirements. The load path depicts a long-latency-load in the Agen unit running ahead of a long-chime load in the merge and SegBuf units. The store path depicts a sequence of short-chime stores in the SegBuf, Merge, and Agen units.
  • Figure 4: The backend organization for a configuration with two arithmetic sequencers, separate load/store sequencers, 4-entry instruction queues, and a 4x3R1W vector register file. Gradated regions indicate the out-of-order execution window.
  • Figure 5: A comparison of instruction cracking vs sequencing for a block of vector instructions A/B/C/D. The cracking approach will stall dispatch without deep issue queues.
  • ...and 8 more figures