Table of Contents
Fetching ...

NEURA: A Unified and Retargetable Compilation Framework for Coarse-Grained Reconfigurable Architectures

Shangkun Li, Jinming Ge, Diyuan Tao, Zeyu Li, Jiawei Liang, Linfeng Du, Jiang Xu, Wei Zhang, Cheng Tan

Abstract

Coarse-Grained Reconfigurable Architectures (CGRAs) are a promising and versatile accelerator platform, offering a balance between the performance and efficiency of specialized accelerators and the software programmability. However, their full potential is severely hindered by control flow in accelerated kernels, as the control flow (e.g., loops, branches) is fundamentally incompatible with the parallel, data-driven CGRA fabric. Prior strategies to resolve this mismatch in CGRA kernel acceleration are either inefficient, sacrificing performance for generality, or lack generality due to the difficulty of adapting them across different execution models. Thus, a general and unified solution for efficient CGRA kernel acceleration remains elusive. This paper introduces NEURA, a unified and retargetable compilation framework that systematically resolves the control-dataflow mismatch in CGRAs. NEURA's core innovation is a novel, pure dataflow intermediate representation (IR) built on a predicated type system. In this IR, control contexts are embedded as a predicate within each data, making control an intrinsic property of data. This mechanism enables NEURA to systematically flatten complex control flow into a single unified dataflow graph. This unified representation decouples kernel representation from hardware, empowering NEURA to retarget diverse CGRAs with different execution models and microarchitectural features. When targeted to a high-performance spatio-temporal CGRA, NEURA delivers a 2.20x speedup on kernel benchmarks and up to 2.71x geometric mean speedup on real-world applications over state-of-the-art (SOTA) high-performance baselines. It also provides a competitive solution against the SOTA low-power CGRA when retargeted to a spatial-only CGRA. NEURA is open-source and available at https://github.com/coredac/neura.

NEURA: A Unified and Retargetable Compilation Framework for Coarse-Grained Reconfigurable Architectures

Abstract

Coarse-Grained Reconfigurable Architectures (CGRAs) are a promising and versatile accelerator platform, offering a balance between the performance and efficiency of specialized accelerators and the software programmability. However, their full potential is severely hindered by control flow in accelerated kernels, as the control flow (e.g., loops, branches) is fundamentally incompatible with the parallel, data-driven CGRA fabric. Prior strategies to resolve this mismatch in CGRA kernel acceleration are either inefficient, sacrificing performance for generality, or lack generality due to the difficulty of adapting them across different execution models. Thus, a general and unified solution for efficient CGRA kernel acceleration remains elusive. This paper introduces NEURA, a unified and retargetable compilation framework that systematically resolves the control-dataflow mismatch in CGRAs. NEURA's core innovation is a novel, pure dataflow intermediate representation (IR) built on a predicated type system. In this IR, control contexts are embedded as a predicate within each data, making control an intrinsic property of data. This mechanism enables NEURA to systematically flatten complex control flow into a single unified dataflow graph. This unified representation decouples kernel representation from hardware, empowering NEURA to retarget diverse CGRAs with different execution models and microarchitectural features. When targeted to a high-performance spatio-temporal CGRA, NEURA delivers a 2.20x speedup on kernel benchmarks and up to 2.71x geometric mean speedup on real-world applications over state-of-the-art (SOTA) high-performance baselines. It also provides a competitive solution against the SOTA low-power CGRA when retargeted to a spatial-only CGRA. NEURA is open-source and available at https://github.com/coredac/neura.

Paper Structure

This paper contains 33 sections, 1 theorem, 19 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Given an input function $F$ represented in NEURA CDFG IR containing no semantically irrelevant (i.e., dead) code, the canonicalize-live-in pass transforms $F$ into an equivalent representation $F'$. This transformation guarantees that for every control flow edge $e=(B_{pred}, B_{succ})$ from a prede

Figures (19)

  • Figure 1: CGRA Architecture --- (a) The CGRA is invoked via the accelerator command interface by the host processor. Control signals generated by the host processor and data required for computation are loaded through the Direct Memory Access (DMA) Unit from shared memory into the CGRA's SRAM and each tile's control memory unit. (b) CGRA tile components.
  • Figure 2: Kernel Representation and Execution Models --- (a) A synthetic kernel and the CFG of this kernel. We show the source code instead of the assembly code in the CFG for simplicity. (b) The DFG of the loop body. (c) The 6-node DFG requires a larger $3\times3$ array to execute in a spatial-only CGRA. The spatio-temporal CGRA can execute the same DFG on a smaller $2\times2$ array by time-multiplexing the tiles.
  • Figure 3: Motivating Example --- (a) A synthetic kernel with imperfect nested loop and branch divergence and its CDFG. (b) The CDFG strategy serializes the execution of each BB via an external controller. (c) The steering control strategy flattens the control flow in the CFG for spatial-only execution. (d) Limited predication transforms the branch divergence into a single BB but fails to resolve the nested control flow with loops. (e) NEURA represents the kernel as a unified DFG, exploiting both intra- and inter-BB parallelism.
  • Figure 4: Overview of the NEURA Compilation Flow -- The frontend accepts C/C++ and high-level IR kernels and lowers them into the NEURA CDFG IR. The IR builder then converts the kernel into the NEURA Dataflow IR through preprocessing, data predication, and flattening. The optimizer refines the dataflow IR using HW-Agnostic (e.g., constant folding) and HW-Specific optimizations (e.g., loop streaming). The optimized IR can be validated by the interpreter and processed by the mapper to get mapping results for configuration bitstream generation or performance simulation.
  • Figure 5: A Kernel Example in NEURA --- (a) The input simple accumulation kernel represented in llvm and arith dialects. (b) The corresponding NEURA CDFG IR for further transformations. (c) The NEURA Dataflow IR leverages our predicated type system to represent the kernel in a pure dataflow manner. The blue blocks show the corresponding code between the input IR and the NEURA CDFG IR, while the yellow blocks show the corresponding control-flow logic in CDFG IR and its dataflow representation.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Theorem 1