Performance evaluation of acceleration of convolutional layers on OpenEdgeCGRA
Nicolò Carpentieri, Juan Sapriza, Davide Schiavone, Daniele Jahier Pagliari, David Atienza, Maurizio Martina, Alessio Burrello
TL;DR
This work investigates CNN convolution mapping on the OpenEdgeCGRA edge accelerator, evaluating direct convolution and Im2col-based transformations across multiple tensor-parallelism axes. The study finds that direct convolution with weight-parallelism provides the best latency and energy efficiency, achieving up to $0.665$ MAC/cycle and outperforming a CPU by up to $3.4\times$ in energy and $9.9\times$ in latency on average. The results highlight memory subsystem energy as a major contributor and show strong robustness of the weight-stationary direct approach to hyperparameter changes. Overall, the paper demonstrates that a small, open-hardware, low-power CGRA can effectively accelerate CNN workloads for edge AI, offering a viable option for heterogeneous edge computing platforms.
Abstract
Recently, efficiently deploying deep learning solutions on the edge has received increasing attention. New platforms are emerging to support the increasing demand for flexibility and high performance. In this work, we explore the efficient mapping of convolutional layers on an open-hardware, low-power Coarse-Grain Reconfigurable Array (CGRA), namely OpenEdgeCGRA. We explore both direct implementations of convolution and solutions that transform it into a matrix multiplication through an Im2col transformation, and experiment with various tensor parallelism axes. We show that for this hardware target, direct convolution, coupled with weight parallelism reaches the best latency and energy efficiency, outperforming a CPU implementation by 3.4x and 9.9x in terms of energy and latency, respectively.
