Table of Contents
Fetching ...

Performance evaluation of acceleration of convolutional layers on OpenEdgeCGRA

Nicolò Carpentieri, Juan Sapriza, Davide Schiavone, Daniele Jahier Pagliari, David Atienza, Maurizio Martina, Alessio Burrello

TL;DR

This work investigates CNN convolution mapping on the OpenEdgeCGRA edge accelerator, evaluating direct convolution and Im2col-based transformations across multiple tensor-parallelism axes. The study finds that direct convolution with weight-parallelism provides the best latency and energy efficiency, achieving up to $0.665$ MAC/cycle and outperforming a CPU by up to $3.4\times$ in energy and $9.9\times$ in latency on average. The results highlight memory subsystem energy as a major contributor and show strong robustness of the weight-stationary direct approach to hyperparameter changes. Overall, the paper demonstrates that a small, open-hardware, low-power CGRA can effectively accelerate CNN workloads for edge AI, offering a viable option for heterogeneous edge computing platforms.

Abstract

Recently, efficiently deploying deep learning solutions on the edge has received increasing attention. New platforms are emerging to support the increasing demand for flexibility and high performance. In this work, we explore the efficient mapping of convolutional layers on an open-hardware, low-power Coarse-Grain Reconfigurable Array (CGRA), namely OpenEdgeCGRA. We explore both direct implementations of convolution and solutions that transform it into a matrix multiplication through an Im2col transformation, and experiment with various tensor parallelism axes. We show that for this hardware target, direct convolution, coupled with weight parallelism reaches the best latency and energy efficiency, outperforming a CPU implementation by 3.4x and 9.9x in terms of energy and latency, respectively.

Performance evaluation of acceleration of convolutional layers on OpenEdgeCGRA

TL;DR

This work investigates CNN convolution mapping on the OpenEdgeCGRA edge accelerator, evaluating direct convolution and Im2col-based transformations across multiple tensor-parallelism axes. The study finds that direct convolution with weight-parallelism provides the best latency and energy efficiency, achieving up to MAC/cycle and outperforming a CPU by up to in energy and in latency on average. The results highlight memory subsystem energy as a major contributor and show strong robustness of the weight-stationary direct approach to hyperparameter changes. Overall, the paper demonstrates that a small, open-hardware, low-power CGRA can effectively accelerate CNN workloads for edge AI, offering a viable option for heterogeneous edge computing platforms.

Abstract

Recently, efficiently deploying deep learning solutions on the edge has received increasing attention. New platforms are emerging to support the increasing demand for flexibility and high performance. In this work, we explore the efficient mapping of convolutional layers on an open-hardware, low-power Coarse-Grain Reconfigurable Array (CGRA), namely OpenEdgeCGRA. We explore both direct implementations of convolution and solutions that transform it into a matrix multiplication through an Im2col transformation, and experiment with various tensor parallelism axes. We show that for this hardware target, direct convolution, coupled with weight parallelism reaches the best latency and energy efficiency, outperforming a CPU implementation by 3.4x and 9.9x in terms of energy and latency, respectively.
Paper Structure (9 sections, 5 figures)

This paper contains 9 sections, 5 figures.

Figures (5)

  • Figure 1: (Top) 2D convolution scheme. (Bottom) Direct convolution with weight parallelism. Nine PEs perform dot products between constant weights and sequentially loaded inputs. The other PEs load new inputs or sum partial outputs.
  • Figure 2: Architecture of the $\mathbf{H\mathcal{E}\mathcal{E}Psilon}$ platform used as a test bench for this analysis, where the OpenEdgeCGRA is instantiated along with X-HEEP.
  • Figure 3: Operation distribution of different convolution mapping strategies. Other includes index updates, branch operations, and index manipulation.
  • Figure 4: Energy vs. Latency comparison.
  • Figure 5: Impact on memory and performance of different hyperparameters. Pareto-optimal results are highlighted with a greater color intensity. The experiments of Section \ref{['sec:runtime']} are highlighted by black circles.