Table of Contents
Fetching ...

DR-CGRA: Supporting Loop-Carried Dependencies in CGRAs Without Spilling Intermediate Values

Elad Hadar, Yoav Etsion

TL;DR

This paper tackles the challenge of speeding up tight loops with loop-carried data dependencies on coarse-grain reconfigurable architectures (CGRAs) by introducing DR-CGRA, a massively-multithreaded CGRA where each loop iteration runs as a separate thread. It replaces spill-based inter-iteration data movement with inter-thread communication inside the CGRA fabric via backward edges and a novel Inter-loop Dependency Resolution Unit (ILDR) that updates thread IDs to transfer dependent values between iterations in one cycle. The approach is evaluated on SPEC CPU 2017 benchmarks, showing average loop-acceleration speedups of 2.1–4.5 (average 3.1) over a state-of-the-art CGRA, with performance improving as thread-group size grows and when memory accesses do not dominate the dependency path. These results demonstrate DR-CGRA’s potential to substantially boost performance-per-watt for loop-centric workloads without spilling intermediate values, enabling more aggressive loop parallelism in CGRAs.

Abstract

Coarse-grain reconfigurable architectures (CGRAs) are gaining traction thanks to their performance and power efficiency. Utilizing CGRAs to accelerate the execution of tight loops holds great potential for achieving significant overall performance gains, as a substantial portion of program execution time is dedicated to tight loops. But loop parallelization using CGRAs is challenging because of loop-carried data dependencies. Traditionally, loop-carried dependencies are handled by spilling dependent values out of the reconfigurable array to a memory medium and then feeding them back to the grid. Spilling the values and feeding them back into the grid imposes additional latencies and logic that impede performance and limit parallelism. In this paper, we present the Dependency Resolved CGRA (DR-CGRA) architecture that is designed to accelerate the execution of tight loops. DR-CGRA, which is based on a massively-multithreaded CGRA, runs each iteration as a separate CGRA thread and maps loop-carried data dependencies to inter-thread communication inside the grid. This design ensures the passage of data-dependent values across loop iterations without spilling them out of the grid. The proposed DR-CGRA architecture was evaluated on various SPEC CPU 2017 benchmarks. The results demonstrated significant performance improvements, with an average speedup ranging from 2.1 to 4.5 and an overall average of 3.1 when compared to state-of-the-art CGRA architecture.

DR-CGRA: Supporting Loop-Carried Dependencies in CGRAs Without Spilling Intermediate Values

TL;DR

This paper tackles the challenge of speeding up tight loops with loop-carried data dependencies on coarse-grain reconfigurable architectures (CGRAs) by introducing DR-CGRA, a massively-multithreaded CGRA where each loop iteration runs as a separate thread. It replaces spill-based inter-iteration data movement with inter-thread communication inside the CGRA fabric via backward edges and a novel Inter-loop Dependency Resolution Unit (ILDR) that updates thread IDs to transfer dependent values between iterations in one cycle. The approach is evaluated on SPEC CPU 2017 benchmarks, showing average loop-acceleration speedups of 2.1–4.5 (average 3.1) over a state-of-the-art CGRA, with performance improving as thread-group size grows and when memory accesses do not dominate the dependency path. These results demonstrate DR-CGRA’s potential to substantially boost performance-per-watt for loop-centric workloads without spilling intermediate values, enabling more aggressive loop parallelism in CGRAs.

Abstract

Coarse-grain reconfigurable architectures (CGRAs) are gaining traction thanks to their performance and power efficiency. Utilizing CGRAs to accelerate the execution of tight loops holds great potential for achieving significant overall performance gains, as a substantial portion of program execution time is dedicated to tight loops. But loop parallelization using CGRAs is challenging because of loop-carried data dependencies. Traditionally, loop-carried dependencies are handled by spilling dependent values out of the reconfigurable array to a memory medium and then feeding them back to the grid. Spilling the values and feeding them back into the grid imposes additional latencies and logic that impede performance and limit parallelism. In this paper, we present the Dependency Resolved CGRA (DR-CGRA) architecture that is designed to accelerate the execution of tight loops. DR-CGRA, which is based on a massively-multithreaded CGRA, runs each iteration as a separate CGRA thread and maps loop-carried data dependencies to inter-thread communication inside the grid. This design ensures the passage of data-dependent values across loop iterations without spilling them out of the grid. The proposed DR-CGRA architecture was evaluated on various SPEC CPU 2017 benchmarks. The results demonstrated significant performance improvements, with an average speedup ranging from 2.1 to 4.5 and an overall average of 3.1 when compared to state-of-the-art CGRA architecture.
Paper Structure (14 sections, 13 figures, 1 table)

This paper contains 14 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Simple, single-path dependency. Loop-carried data dependency where only units in the dependency path use that data-dependent variable. With no memory access
  • Figure 2: Dependency with diverging paths. Loop-carried data dependency where other units in the grid consume the data-dependent variable after the update of the data-dependent variable
  • Figure 3: Dependency with diverging paths. Loop-carried data dependency where other units in the grid consume the data-dependent variable prior to the update of the data-dependent variable
  • Figure 4: Dependency with background memory access. Loop-carried data dependency with memory access not on the dependent value
  • Figure 5: Consecutive Dependency. Loop-carried data dependency with consecutive loop-carried data dependency between units. The variable X1 is updated several times in each loop iteration
  • ...and 8 more figures