A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

Zhiyuan Zhao; Yihao Chen; Pengcheng Feng; Jixing Li; Gang Chen; Rongxuan Shen; Huaxiang Lu

A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

Zhiyuan Zhao, Yihao Chen, Pengcheng Feng, Jixing Li, Gang Chen, Rongxuan Shen, Huaxiang Lu

TL;DR

This work tackles memory bottlenecks and limited DSP utilization in FPGA accelerators for lightweight CNNs by introducing a streaming architecture with hybrid computing engines (FRCE for shallow layers and WRCE for deep layers) and a balanced dataflow strategy. A fine-grained parallel mechanism (FGPM) and a dataflow-oriented line buffer scheme optimize resource mapping and mitigate data congestion, while a resource-aware memory/parallelism allocation framework selects the FRCE/WRCE boundary and dynamically tunes per-layer parallelism. The approach yields substantial on-chip memory reductions (up to ~68%), dramatic reductions in off-chip FM traffic, and state-of-the-art MAC efficiency (up to 94.58%) and FPS (up to 2092.4) on MobileNetV2 and ShuffleNetV2 benchmarks with high DSP utilization. These results demonstrate significant improvements in throughput, memory efficiency, and scalability for LWCNN acceleration on FPGAs, enabling robust edge-edge deployment of compact networks.

Abstract

FPGA accelerators for lightweight neural convolutional networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2.Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.

A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 17 figures, 5 tables, 2 algorithms)

This paper contains 22 sections, 7 equations, 17 figures, 5 tables, 2 algorithms.

Introduction
Background & Related Work
LWCNNs with DSCs & SCBs
Distribution of FMs and Weights in LWCNNs
Related Works About LWCNN Accelerators
Accelerator with Hybrid Computing-Engines
Architecture Overview
Hybrid Computing Engines
Layer-Specific CE Design
Adaptive bandwidth computing engine
Dataflow order converter
Accelerator with Balanced Dataflow
Fine-grained Parallel Mechanism
Dataflow-oriented Line Buffer Scheme
Resource-Aware Memory and Parallelism Allocation
...and 7 more sections

Figures (17)

Figure 1: Percentage of DSC and SCB structures in major LWCNNs.
Figure 2: Low computational density operations in LWCNNs. (a) Depthwise Separable Convolution(DSC); (b) Skip-Connection Block(SCB).
Figure 3: Memory requirements for FMs and Weights in LWCNNs. The data for each block is the sum of all the layers within it. (a) MobileNetV2; (b) ShuffleNetV2.
Figure 4: Architecture of the proposed accelerator.
Figure 5: Fully reused feature map scheme performed in line buffer of a $3 \times 3$ convolutional layer.
...and 12 more figures

A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

TL;DR

Abstract

A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

Authors

TL;DR

Abstract

Table of Contents

Figures (17)