Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

Zhaoying Li; Pranav Dangi; Chenyang Yin; Thilini Kaushalya Bandara; Rohan Juneja; Cheng Tan; Zhenyu Bai; Tulika Mitra

Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

Zhaoying Li, Pranav Dangi, Chenyang Yin, Thilini Kaushalya Bandara, Rohan Juneja, Cheng Tan, Zhenyu Bai, Tulika Mitra

TL;DR

This work tackles overprovisioning of CGRA communication resources at edge scales by aligning compute and communication through motif-based hierarchical execution. It introduces Plaid, a co-designed CGRA architecture and compiler that identify recurrent three-node dataflow motifs, execute them within Plaid Collective Units (PCUs), and route data with a hierarchical NoC. Key contributions include the identification of a minimal, reusable three-node motif, a PCU design with local collect routing, a motif-aware compiler with hierarchical mapping, and domain-specialized optimization pathways. Empirical results show Plaid achieves up to 43% power reduction and 46% area reduction versus a baseline spatio-temporal CGRA, while delivering 1.40× performance and 48% area savings versus a spatial CGRA, signaling strong practical impact for energy-constrained edge accelerators.

Abstract

Coarse-grained Reconfigurable Arrays (CGRAs) are domain-agnostic accelerators that enhance the energy efficiency of resource-constrained edge devices. The CGRA landscape is diverse, exhibiting trade-offs between performance, efficiency, and architectural specialization. However, CGRAs often overprovision communication resources relative to their modest computing capabilities. This occurs because the theoretically provisioned programmability for CGRAs often proves superfluous in practical implementations. In this paper, we propose Plaid, a novel CGRA architecture and compiler that aligns compute and communication capabilities, thereby significantly improving energy and area efficiency while preserving its generality and performance. We demonstrate that the dataflow graph, representing the target application, can be decomposed into smaller, recurring communication patterns called motifs. The primary contribution is the identification of these structural motifs within the dataflow graphs and the development of an efficient collective execution and routing strategy tailored to these motifs. The Plaid architecture employs a novel collective processing unit that can execute multiple operations of a motif and route related data dependencies together. The Plaid compiler can hierarchically map the dataflow graph and judiciously schedule the motifs. Our design achieves a 43% reduction in power consumption and 46% area savings compared to the baseline high-performance spatio-temporal CGRA, all while preserving its generality and performance levels. In comparison to the baseline energy-efficient spatial CGRA, Plaid offers a 1.4x performance improvement and a 48% area savings, with almost the same power.

Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

TL;DR

Abstract

Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)