COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing

Steven Colleman; Man Shi; Marian Verhelst

COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing

Steven Colleman, Man Shi, Marian Verhelst

TL;DR

COAC addresses the challenge of deploying CNNs with diverse layer topologies on edge accelerators by introducing a cross-layer design-space exploration framework that explicitly models the overhead of supporting multiple spatial unrollings (SUs). It builds a unified overhead model around three blocks (data assignment, output aggregation, reshuffling buffer) and uses an automated flow to find Pareto-optimal SU combinations that balance energy, latency, and area. Empirical results across six networks show up to 38% EDP savings at about 9.5% area overhead when optimizing SU sets, with MobileNetv2 demonstrating the largest gains; COAC can achieve comparable end-to-end performance with far fewer SUs than Evolver while using less area. The work provides practical guidance for designing flexible NN accelerators and includes pruning to accelerate the search, making cross-layer configurability viable for real-world edge deployments.

Abstract

To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer's spatial (dataflow parallelization) and temporal unrolling (execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This paper, therefore, presents COAC, a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings, but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% EDP (energy-delay-product) savings for a set of six neural networks at the expense of a relative area increase of 9.5%.

COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing

TL;DR

Abstract

Paper Structure (31 sections, 32 equations, 14 figures, 8 tables)

This paper contains 31 sections, 32 equations, 14 figures, 8 tables.

Introduction: literature survey and contribution
Background and motivation
Spatial Unrolling on PE array
SU combining benefits for varying workloads
Impact of temporal unrolling on hardware utilization
Estimating SU reconfigurability overhead
Introduction
Data assignment block cost analysis
Output aggregation network cost analysis
Reshuffling buffer overhead modeling
Impact of SU similarity
Data assignment block
Output aggregation network
Reshuffling buffer
General insight
...and 16 more sections

Figures (14)

Figure 1: Problem statement with our contribution: systematically derive combination of SUs to be able to exploit variability with minimizing resources overhead.
Figure 2: Black box behavior of COAC. The internal working mechanism of COAC is explained throughout the paper.
Figure 3: Meaning of Spatial Unrolling versus Temporal Unrolling.
Figure 4: Data distribution is completely different for different SUs.
Figure 5: Flexible Spatial unrolling (SU) template architecture, including the three overhead components (red/green/blue blocks): The data assignment block contains registers and MUXes, the output aggregation network contains MUXes and adders, the reshuffling buffer contains registers and MUXes. A more detailed illustration on how to configure these blocks can be found in Fig. \ref{['fig:input3']}.
...and 9 more figures

COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing

TL;DR

Abstract

COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (14)