Table of Contents
Fetching ...

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes

Miaoxin Wang, Xiao Wu, Jun Lin, Zhongfeng Wang

TL;DR

The paper tackles efficient FPGA-based inference for CNNs with arbitrary, large kernel sizes, addressing data movement and memory overhead that hinder conventional fixed-kernel accelerators. It introduces a flexible dataflow (Z-Flow) and kernel-segmentation (Kseg) to maximize data reuse and minimize on-chip storage for AKCV operations. At the block level, VF and HF strategies optimize fused and multi-branch structures, enabling continuous computation without excessive off-chip transfers. Empirical results on Intel Arria 10 show substantial DSP efficiency gains (up to 3.91x) and high throughputs for large-kernel networks like RepLKNet-31 and PyConvResNet-50, demonstrating practical viability for large-kernel CNNs on FPGA hardware.

Abstract

Convolutional neural networks (CNNs) with large kernels, drawing inspiration from the key operations of vision transformers (ViTs), have demonstrated impressive performance in various vision-based applications. To address the issue of computational efficiency degradation in existing designs for supporting large-kernel convolutions, an FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. Firstly, a Z-flow method is presented to optimize the computing data flow by maximizing data reuse opportunity. Besides, the proposed design, incorporating the kernel-segmentation (Kseg) scheme, enables extended support for large-kernel convolutions, significantly reducing the storage requirements for overlapped data. Moreover, based on the analysis of typical block structures in emerging CNNs, vertical-fused (VF) and horizontal-fused (HF) methods are developed to optimize CNN deployments from both computation and transmission perspectives. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network. Particularly, it demonstrates efficient support for large-kernel CNNs, achieving throughputs of 169.68 GOPS and 244.55 GOPS for RepLKNet-31 and PyConvResNet-50, respectively, both of which are implemented on hardware for the first time.

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes

TL;DR

The paper tackles efficient FPGA-based inference for CNNs with arbitrary, large kernel sizes, addressing data movement and memory overhead that hinder conventional fixed-kernel accelerators. It introduces a flexible dataflow (Z-Flow) and kernel-segmentation (Kseg) to maximize data reuse and minimize on-chip storage for AKCV operations. At the block level, VF and HF strategies optimize fused and multi-branch structures, enabling continuous computation without excessive off-chip transfers. Empirical results on Intel Arria 10 show substantial DSP efficiency gains (up to 3.91x) and high throughputs for large-kernel networks like RepLKNet-31 and PyConvResNet-50, demonstrating practical viability for large-kernel CNNs on FPGA hardware.

Abstract

Convolutional neural networks (CNNs) with large kernels, drawing inspiration from the key operations of vision transformers (ViTs), have demonstrated impressive performance in various vision-based applications. To address the issue of computational efficiency degradation in existing designs for supporting large-kernel convolutions, an FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. Firstly, a Z-flow method is presented to optimize the computing data flow by maximizing data reuse opportunity. Besides, the proposed design, incorporating the kernel-segmentation (Kseg) scheme, enables extended support for large-kernel convolutions, significantly reducing the storage requirements for overlapped data. Moreover, based on the analysis of typical block structures in emerging CNNs, vertical-fused (VF) and horizontal-fused (HF) methods are developed to optimize CNN deployments from both computation and transmission perspectives. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network. Particularly, it demonstrates efficient support for large-kernel CNNs, achieving throughputs of 169.68 GOPS and 244.55 GOPS for RepLKNet-31 and PyConvResNet-50, respectively, both of which are implemented on hardware for the first time.
Paper Structure (10 sections, 1 equation, 3 figures, 1 table)

This paper contains 10 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: The overall architecture and computing dataflow. (a) The overall architecture of the proposed accelerator. (b) Convolution operation and design variables. (c) MAC array architecture. (d) Detailed computing dataflow of the Z-flow method when the stride is 1. (e) Segmentation scheme of kernel map.
  • Figure 2: Fusion methods and corresponding execution scheduling. (a) Structures of typical blocks. (b) Data transactions between on-chip buffers and computing units. (c) The execution scheduling of different fusion strategies.
  • Figure 3: Performance improvement (PI) of different typical blocks. (a) PI of MBconv blocks in Mobilenetv3 with VF method. (b) PI of RepLK blocks with VF method. (c) PI of PyConv blocks with HF method.