Table of Contents
Fetching ...

A Unified Framework for Mapping and Synthesis of Approximate R-Blocks CGRAs

Georgios Alexandris, Panagiotis Chaidos, Alexis Maras, Barry de Bruin, Manil Dev Gomony, Henk Corporaal, Dimitrios Soudris, Sotirios Xydis

TL;DR

The paper tackles the energy efficiency gap in edge AI by proposing an end-to-end framework that co-optimizes hardware and software for approximate CGRAs. It introduces DRUM-based approximate multipliers and voltage islands in a heterogeneous CGRA, paired with an approximation-aware DNN mapping that uses per-output-channel importance factors under QoS constraints. Key contributions include a PyTorch/Brevitas-enhanced model for DRUM-aware training, a formal importance-factor mapping strategy, and a PASM-to-RTL flow that yields synthesizable designs incorporating memory and voltage-domain considerations. Evaluations on MobileNetV2/ImageNet show up to 440 GOPS/W with modest output error and around 30% energy savings over baselines, outperforming several state-of-the-art CGRAs in throughput and energy efficiency. The work demonstrates a practical path to high-throughput, low-power edge AI accelerators through calibrated approximation and voltage-domain optimization.

Abstract

The ever-increasing complexity and operational diversity of modern Neural Networks (NNs) have caused the need for low-power and, at the same time, high-performance edge devices for AI applications. Coarse Grained Reconfigurable Architectures (CGRAs) form a promising design paradigm to address these challenges, delivering a close-to-ASIC performance while allowing for hardware programmability. In this paper, we introduce a novel end-to-end exploration and synthesis framework for approximate CGRA processors that enables transparent and optimized integration and mapping of state-of-the-art approximate multiplication components into CGRAs. Our methodology introduces a per-channel exploration strategy that maps specific output features onto approximate components based on accuracy degradation constraints. This enables the optimization of the system's energy consumption while retaining the accuracy above a certain threshold. At the circuit level, the integration of approximate components enables the creation of voltage islands that operate at reduced voltage levels, which is attributed to their inherently shorter critical paths. This key enabler allows us to effectively reduce the overall power consumption by an average of 30% across our analyzed architectures, compared to their baseline counterparts, while incurring only a minimal 2% area overhead. The proposed methodology was evaluated on a widely used NN model, MobileNetV2, on the ImageNet dataset, demonstrating that the generated architectures can deliver up to 440 GOPS/W with relatively small output error during inference, outperforming several State-of-the-Art CGRA architectures in terms of throughput and energy efficiency.

A Unified Framework for Mapping and Synthesis of Approximate R-Blocks CGRAs

TL;DR

The paper tackles the energy efficiency gap in edge AI by proposing an end-to-end framework that co-optimizes hardware and software for approximate CGRAs. It introduces DRUM-based approximate multipliers and voltage islands in a heterogeneous CGRA, paired with an approximation-aware DNN mapping that uses per-output-channel importance factors under QoS constraints. Key contributions include a PyTorch/Brevitas-enhanced model for DRUM-aware training, a formal importance-factor mapping strategy, and a PASM-to-RTL flow that yields synthesizable designs incorporating memory and voltage-domain considerations. Evaluations on MobileNetV2/ImageNet show up to 440 GOPS/W with modest output error and around 30% energy savings over baselines, outperforming several state-of-the-art CGRAs in throughput and energy efficiency. The work demonstrates a practical path to high-throughput, low-power edge AI accelerators through calibrated approximation and voltage-domain optimization.

Abstract

The ever-increasing complexity and operational diversity of modern Neural Networks (NNs) have caused the need for low-power and, at the same time, high-performance edge devices for AI applications. Coarse Grained Reconfigurable Architectures (CGRAs) form a promising design paradigm to address these challenges, delivering a close-to-ASIC performance while allowing for hardware programmability. In this paper, we introduce a novel end-to-end exploration and synthesis framework for approximate CGRA processors that enables transparent and optimized integration and mapping of state-of-the-art approximate multiplication components into CGRAs. Our methodology introduces a per-channel exploration strategy that maps specific output features onto approximate components based on accuracy degradation constraints. This enables the optimization of the system's energy consumption while retaining the accuracy above a certain threshold. At the circuit level, the integration of approximate components enables the creation of voltage islands that operate at reduced voltage levels, which is attributed to their inherently shorter critical paths. This key enabler allows us to effectively reduce the overall power consumption by an average of 30% across our analyzed architectures, compared to their baseline counterparts, while incurring only a minimal 2% area overhead. The proposed methodology was evaluated on a widely used NN model, MobileNetV2, on the ImageNet dataset, demonstrating that the generated architectures can deliver up to 440 GOPS/W with relatively small output error during inference, outperforming several State-of-the-Art CGRA architectures in terms of throughput and energy efficiency.

Paper Structure

This paper contains 17 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the architecture template and its integration with the host system.
  • Figure 2: SW Compilation & HW Generation Flow
  • Figure 3: Overview of the proposed DNN mapping flow. For each conv layer, the process begins by extracting the importance score for each output channel. Based on these scores, each channel is assigned to either accurate or approximate units.
  • Figure 4: Area and power comparison of the examined architectures before (R-Blocks) and after DRUM Multiplier Integration and Voltage Scaling.