Table of Contents
Fetching ...

Nexus Machine: An Active Message Inspired Reconfigurable Architecture for Irregular Workloads

Rohan Juneja, Pranav Dangi, Thilini Kaushalya Bandara, Tulika Mitra, Li-shiuan Peh

TL;DR

Nexus Machine tackles irregular workloads on resource-constrained edge devices by introducing an Active Message–inspired reconfigurable architecture that performs data-driven execution and en-route computation. It unifies coarse-grained tensor partitioning, data-local execution, and in-network computing with a flexible AM format and dynamic routing, supported by a compiler and runtime stack. Empirical results show up to 90% higher performance and 70% higher fabric utilization than state-of-the-art baselines, with 22 nm implementation achieving 1.9x improvement over a generic CGRA and 1.7x fabric utilization gains, and an average 1.35x performance boost over prior art. The work demonstrates strong potential for energy-efficient, scalable irregular workloads on edge CGRAs, enabling robust performance across sparse, dense, and graph workloads.

Abstract

Modern reconfigurable architectures are increasingly favored for resource-constrained edge devices as they balance high performance, energy efficiency, and programmability well. However, their proficiency in handling regular compute patterns constrains their effectiveness in executing irregular workloads, such as sparse linear algebra and graph analytics with unpredictable access patterns and control flow. To address this limitation, we introduce the Nexus Machine, a novel reconfigurable architecture consisting of a PE array designed to efficiently handle irregularity by distributing sparse tensors across the fabric and employing active messages that morph instructions based on dynamic control flow. As the inherent irregularity in workloads can lead to high load imbalance among different Processing Elements (PEs), Nexus Machine deploys and executes instructions en-route on idle PEs at run-time. Thus, unlike traditional reconfigurable architectures with only static instructions within each PE, Nexus Machine brings dynamic control to the idle compute units, mitigating load imbalance and enhancing overall performance. Our experiments demonstrate that Nexus Machine achieves 90% better performance compared to state-of-the-art (SOTA) reconfigurable architectures, within the same power budget and area. Nexus Machine also achieves 70% higher fabric utilization, in contrast to SOTA architectures.

Nexus Machine: An Active Message Inspired Reconfigurable Architecture for Irregular Workloads

TL;DR

Nexus Machine tackles irregular workloads on resource-constrained edge devices by introducing an Active Message–inspired reconfigurable architecture that performs data-driven execution and en-route computation. It unifies coarse-grained tensor partitioning, data-local execution, and in-network computing with a flexible AM format and dynamic routing, supported by a compiler and runtime stack. Empirical results show up to 90% higher performance and 70% higher fabric utilization than state-of-the-art baselines, with 22 nm implementation achieving 1.9x improvement over a generic CGRA and 1.7x fabric utilization gains, and an average 1.35x performance boost over prior art. The work demonstrates strong potential for energy-efficient, scalable irregular workloads on edge CGRAs, enabling robust performance across sparse, dense, and graph workloads.

Abstract

Modern reconfigurable architectures are increasingly favored for resource-constrained edge devices as they balance high performance, energy efficiency, and programmability well. However, their proficiency in handling regular compute patterns constrains their effectiveness in executing irregular workloads, such as sparse linear algebra and graph analytics with unpredictable access patterns and control flow. To address this limitation, we introduce the Nexus Machine, a novel reconfigurable architecture consisting of a PE array designed to efficiently handle irregularity by distributing sparse tensors across the fabric and employing active messages that morph instructions based on dynamic control flow. As the inherent irregularity in workloads can lead to high load imbalance among different Processing Elements (PEs), Nexus Machine deploys and executes instructions en-route on idle PEs at run-time. Thus, unlike traditional reconfigurable architectures with only static instructions within each PE, Nexus Machine brings dynamic control to the idle compute units, mitigating load imbalance and enhancing overall performance. Our experiments demonstrate that Nexus Machine achieves 90% better performance compared to state-of-the-art (SOTA) reconfigurable architectures, within the same power budget and area. Nexus Machine also achieves 70% higher fabric utilization, in contrast to SOTA architectures.

Paper Structure

This paper contains 31 sections, 17 figures, 2 tables, 1 algorithm.

Figures (17)

  • Figure 1: Spatial vs Spatio-temporal
  • Figure 2: Active Message (AM) communication mechanism: An AM originating from PE A is launched at destination PE B, where it is executed on the ALU, and interacts with the data and configuration memory. Additional AMs may be generated in response, if necessary.
  • Figure 3: Program execution comparison for SpMV kernel, illustrating two consecutive iterations with a bank conflict. (a) Generic CGRA: Data flows through statically placed instructions (top) and bank conflicts across various banks for a real workload with n=2048 on a 4x4 PE array (bottom) (b) Triggered Instructions: Illustrates data-local execution with messages, invoking tasks at the location of data, reducing data movement (top) and visual representation of the load imbalance across the PE array (bottom) (c) Nexus Machine: Enhances performance and PE utilization through a unique approach — enabling opportunistic execution and utilizing idle ALUs for en-route instruction execution (top) and visual representation of the uniform load balance across the PE array (bottom). Data movements are represented by red arrows, while blue arrows depict message transfers.
  • Figure 4: Sparse Matrix-Vector Multiplication (SpMV).
  • Figure 5: Execution of SpMV using the data in Fig. \ref{['fig:motivation']} on a fabric with 2 PEs. It illustrates the placement of matrix, vector, and output partitions, along with AM generation. [] denotes element address, and red arrows represent control signals.
  • ...and 12 more figures