A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications

Panagiotis Mousouliotis; Georgios Keramidas

A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications

Panagiotis Mousouliotis, Georgios Keramidas

TL;DR

The paper tackles the challenge of deploying CNNs for embedded deep learning on resource-limited FPGA SoCs by introducing a parameterizable CNN accelerator template implemented via high-level synthesis. The design comprises a two-part architecture (CONV-PART and MPOOL-PART) with an 8-bit dynamic fixed-point quantization scheme and extensive HW/SW co-design optimizations, enabling latency, power, and area trade-offs across a range of CNN workloads. Key contributions include the detailed accelerator architecture, quantization strategy, HLS-driven parallelism, and an extensive evaluation showing favorable latency and power compared with related works. The work demonstrates a flexible, scalable approach for embedded DL that can be extended to other DL applications and workload families.

Abstract

Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.

A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 4 figures, 5 tables)

This paper contains 17 sections, 2 equations, 4 figures, 5 tables.

Introduction
Related Work
Parameterized Accelerator Template
Supported Operations
Convolution
Optional Activation Layer and Max-Pool
Summary
Architecture
Convolution Part
Max-pool Part
Quantization Strategy
Parallelism Exploitation using HLS
Design Parameters
Application-level Design
Experiments
...and 2 more sections

Figures (4)

Figure 1: Operation of the convolution layer. The number of 3D filters is equal to ${C_o}$, the number of output channels.
Figure 2: Accelerator block diagram augmented with HLS optimizations.
Figure 3: The accelerator implementation of the MPOOL operation for the case of a $3 \times 3$ filter. The channel dimension is omitted for simplicity.
Figure 4: The implementation of the CONV operation.

A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications

TL;DR

Abstract

A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (4)