Table of Contents
Fetching ...

Loop Control Management in Tightly Coupled Processor Arrays (TCPAs)

Dominik Walter, Frank Hannig, Jürgen Teich

Abstract

Multidimensional loop kernels often suffer from control overhead that can dominate execution time on parallel loop accelerators. Tightly Coupled Processor Arrays (TCPAs) offload loop control to a global controller (GC), but existing approaches still require hundreds of control signals. We propose a method to derive and aggressively reduce these control conditions from a polyhedral representation of the iteration space, achieving reductions of 15x to 45x in control signals across several benchmarks. We introduce a lightweight GC architecture that evaluates conditions as unions of polyhedra using bounded evaluation units, requiring hardware comparable to a single processing element. Control signals are distributed throughout the array with a minimal number of delay elements resulting in zero-overhead loop control. Our evaluation on PolyBench kernels shows that the entire control flow requires < 10 % of the total array resources.

Loop Control Management in Tightly Coupled Processor Arrays (TCPAs)

Abstract

Multidimensional loop kernels often suffer from control overhead that can dominate execution time on parallel loop accelerators. Tightly Coupled Processor Arrays (TCPAs) offload loop control to a global controller (GC), but existing approaches still require hundreds of control signals. We propose a method to derive and aggressively reduce these control conditions from a polyhedral representation of the iteration space, achieving reductions of 15x to 45x in control signals across several benchmarks. We introduce a lightweight GC architecture that evaluates conditions as unions of polyhedra using bounded evaluation units, requiring hardware comparable to a single processing element. Control signals are distributed throughout the array with a minimal number of delay elements resulting in zero-overhead loop control. Our evaluation on PolyBench kernels shows that the entire control flow requires < 10 % of the total array resources.

Paper Structure

This paper contains 16 sections, 13 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: Architecture of an $8 \times 8$ TCPA. The array of PEs is surrounded by 4 I/O buffers accessed by dedicated Address Generators (AGs) and a Loop I/O Controller (LION). All loop control is handled by the Global Controller (GC).
  • Figure 2: Example of 3 control graphs. Each node represents a program block (sequence of instructions) and each edge denotes a branch between such program blocks. The annotated conditions indicate when the corresponding branch must be taken.
  • Figure 3: A two-dimensional iteration space with different control conditions (red, blue, green) is shown for 6 different cases. The solid rectangles denote the one domain, and the crosshatched rectangle denotes the corresponding zero domain. In (a)--(d), both control conditions can be expressed by a single control signal, while in (e), both control conditions are conflicting. Although the three control conditions in (f) are pairwise compatible, they cannot be unified into a single control condition.
  • Figure 4: Architecture of the Global Controller (GC). It consists of an Iteration Space Scanner (left), which feeds into a number of evaluators (center), whose outputs are combined into conjunctions and disjunctions (right).
  • Figure 5: Control network in a $4 \times 4$ TCPA. Control signals are generated in the GC and propagated to all PEs. Delay elements within each PE shift the incoming signals in time so that they arrive according to the loop schedule.
  • ...and 3 more figures