Evaluation of CGRA Toolchains
Dominik Walter, Marita Halm, Daniel Seidel, Indrayudh Ghosh, Christian Heidorn, Frank Hannig, Jürgen Teich
TL;DR
The paper evaluates four publicly available CGRA toolchains (CGRA-Flow, Morpher, Pillars, CGRA-ME) by mapping loop-based data flows onto CGRAs and comparing latency across five PolyBench benchmarks. It analyzes how DFGs are generated, mapped, and scheduled, highlighting the role of initiation interval constraints ($II$, $RecMII$, $ResMII$) and hardware routing limitations in driving PE underutilization. HyCUBE’s multi-hop interconnect generally improves performance over classic CGRAs, while Morpher delivers robust mappings in most cases, though Pillars often fails due to DFG and mapping limitations. The study reveals significant headroom in current mappings and emphasizes routing and data-flow handling as key bottlenecks, suggesting directions toward broader design-space exploration and processor-array alternatives.
Abstract
Increasing demands for computing power also propel the need for energy-efficient SoC accelerator architectures. One class for such accelerators are so-called processor arrays, which typically integrate a two-dimensional mesh of interconnected processing elements (PEs). Such arrays are specifically designed to accelerate the execution of multidimensional nested loops by exploiting the intrinsic parallelism of such loops. Coarse-grained reconfigurable arrays (CGRAs) belong to this class of accelerator architectures. In this work, we analyze four toolchains for mapping loop programs onto CGRAs and compare the resulting mappings wrt. performance, i.e., latency. While most toolchains succeed in simpler kernels like general matrix multiplication, some struggle to find valid mappings for more complex loops like a triangular solver. Furthermore, we observe that the considered CGRA mappers generally tend to underutilize the available PEs.
