Table of Contents
Fetching ...

Mapping and Execution of Nested Loops on Processor Arrays: CGRAs vs. TCPAs

Dominik Walter, Marita Halm, Daniel Seidel, Indrayudh Ghosh, Christian Heidorn, Frank Hannig, Jürgen Teich

TL;DR

The paper compares two processor-array paradigms for accelerating multidimensional nested loops: operation-centric CGRAs and iteration-centric TCPAs. It analyzes architecture, mapping approaches, and toolchains (CGRA-Flow, Morpher, Pillars, CGRA-ME for CGRAs; TURTLE for TCPAs), and performs qualitative and quantitative evaluations across PPA metrics. Findings show TCPAs generally deliver substantial latency reductions and higher data locality through tile-based iteration mapping, albeit at higher hardware complexity and specific data-massage requirements; CGRAs offer simpler, more intuitive programming models but face scalability and mapping-complexity limits. The work highlights the trade-offs between programming ease, hardware cost, and performance, and suggests that future designs may blend iteration- and operation-centric ideas to capitalize on both data locality and flexible mapping.

Abstract

Increasing demands for computing power also propel the need for energy-efficient SoC accelerator architectures. One class of such accelerators are so-called processor arrays, which typically integrate a two-dimensional mesh of interconnected processing elements~(PEs). Such arrays are specifically designed to accelerate the execution of multidimensional nested loops by exploiting the intrinsic parallelism of loops. Moreover, for mapping a given loop nest application, two opposed mapping methods have emerged: Operation-centric and iteration-centric. Both differ in the granularity of the mapping. The operation-centric approach maps individual operations to the PEs of the array, while the iteration-centric approach maps entire tiles of iterations to each PE. The operation-centric approach is applied predominantly for processor arrays often referred to as Coarse-Grained Reconfigurable Arrays~(CGRAs), while processor arrays supporting an iteration-centric approach are referred to as Tightly-Coupled Processor Arrays~(TCPAs) in the following. This work provides a comprehensive comparison of both approaches and related architectures by evaluating their respective benefits and trade-offs. ...

Mapping and Execution of Nested Loops on Processor Arrays: CGRAs vs. TCPAs

TL;DR

The paper compares two processor-array paradigms for accelerating multidimensional nested loops: operation-centric CGRAs and iteration-centric TCPAs. It analyzes architecture, mapping approaches, and toolchains (CGRA-Flow, Morpher, Pillars, CGRA-ME for CGRAs; TURTLE for TCPAs), and performs qualitative and quantitative evaluations across PPA metrics. Findings show TCPAs generally deliver substantial latency reductions and higher data locality through tile-based iteration mapping, albeit at higher hardware complexity and specific data-massage requirements; CGRAs offer simpler, more intuitive programming models but face scalability and mapping-complexity limits. The work highlights the trade-offs between programming ease, hardware cost, and performance, and suggests that future designs may blend iteration- and operation-centric ideas to capitalize on both data locality and flexible mapping.

Abstract

Increasing demands for computing power also propel the need for energy-efficient SoC accelerator architectures. One class of such accelerators are so-called processor arrays, which typically integrate a two-dimensional mesh of interconnected processing elements~(PEs). Such arrays are specifically designed to accelerate the execution of multidimensional nested loops by exploiting the intrinsic parallelism of loops. Moreover, for mapping a given loop nest application, two opposed mapping methods have emerged: Operation-centric and iteration-centric. Both differ in the granularity of the mapping. The operation-centric approach maps individual operations to the PEs of the array, while the iteration-centric approach maps entire tiles of iterations to each PE. The operation-centric approach is applied predominantly for processor arrays often referred to as Coarse-Grained Reconfigurable Arrays~(CGRAs), while processor arrays supporting an iteration-centric approach are referred to as Tightly-Coupled Processor Arrays~(TCPAs) in the following. This work provides a comprehensive comparison of both approaches and related architectures by evaluating their respective benefits and trade-offs. ...

Paper Structure

This paper contains 40 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Operation-centric mapping approach of Coarse Grained Reconfigurable Arrays. On the left, a simplified data flow graph (DFG) of a matrix multiplication is shown. The nodes representing operations are grouped into indices computation (blue), address computation (brown), memory access (purple) and finally multiply and accumulate operations (red). Edges denote the data dependencies. This DFG is mapped onto the 4$\times 4$ CGRA architecture shown on the right. Each PE contains a functional unit (FU), e. g., an ALU, a local register file, switches, and a configuration memory, adapted from 2_HyCUBE.
  • Figure 2: Architecture of an $8\times 8$ TCPA (left) and the OIP-based BrandTeich2017IEEEMCSOC PE architecture (right) from alpaca. The array is surrounded by 4 I/O buffers with address generators and has peripheral controllers shown left to the array. Each PE has a data and control register file and may have multiple functional units.
  • Figure 3: A matrix multiplication $\mathbf{C} = \mathbf{A} \cdot \mathbf{B}$ with $\mathbf{A}, \mathbf{B}, \mathbf{C} \in \mathbb{Z}^{N \times N}$ expressed as PRAs with iteration space $\mathcal{I} = \{ (i_0, i_1, i_2)^\intercal \in \mathbb{Z}^3 \mid 0 \leq i_0, i_1, i_2 < N \}$. The notations $x[\bm{i}]$ and $x[i_0,i_1,i_2]$ are equivalent.
  • Figure 4: Simplified $4\times 4 \times 4$ iteration space of a matrix multiplication that is tiled into $2 \times 2 \times 1$ tiles and mapped onto a $2 \times 2$ PE array shown behind. Each gray circle denotes one iteration consisting of 4 operations as specified by the loop body. The edges denote data dependencies, whereas its color denote its type, i. e., input (red), intra-iteration (white), inter-iteration intra-tile (yellow), or inter-iteration inter-tile (green). Although each iteration contains the same operations, the type of the data dependencies of the contained operations is different, which is reflected by both the position and color of the operation.
  • Figure 5: Overview of the TURTLE toolchain for TCPAs.
  • ...and 3 more figures