Data-Rate-Aware High-Speed CNN Inference on FPGAs

Tobias Habermann; Martin Kumm

Data-Rate-Aware High-Speed CNN Inference on FPGAs

Tobias Habermann, Martin Kumm

TL;DR

Experimental results show substantial reductions in arithmetic resources compared to previous designs, enabling efficient implementation of complex CNNs on a single FPGA across a wide range of data rates.

Abstract

Dataflow-based CNN accelerators on FPGAs achieve low latency and high throughput by mapping computations of each layer directly to corresponding hardware units. However, layers such as pooling and strided convolutions reduce the data at their output with respect to their input, strongly effecting the data rate of the following layers. This leads to underutilization in fully unrolled designs. While prior work introduced data-rate-aware layer-wise adaptation, determining the most efficient implementation remains challenging. This paper presents a data-rate-aware CNN accelerator architecture for multi-pixel processing. Building on existing analytical models, the proposed method performs design-space exploration to identify configurations that improve hardware utilization and resource efficiency while preserving continuous flow of data, keeping all hardware units busy. Experimental results show substantial reductions in arithmetic resources compared to previous designs, enabling efficient implementation of complex CNNs on a single FPGA across a wide range of data rates.

Data-Rate-Aware High-Speed CNN Inference on FPGAs

TL;DR

Abstract

Paper Structure (10 sections, 11 equations, 6 figures, 2 tables)

This paper contains 10 sections, 11 equations, 6 figures, 2 tables.

Introduction
Adapting Continuous-Flow
Continuous-Flow architecture
Improved Continuous-Flow Architecture
Defining the Constrains of $j$ and $h$
Defining the layer implementation parameters of layer $\ell$
Adapting the Architecture for Multi-pixel Processing
Experiments
Future Work
Conclusion

Figures (6)

Figure 1: The KPU base component presented in main_ref.
Figure 2: The FCU base component presented in main_ref.
Figure 3: The structure of convolutional and fully connected layers.
Figure 4: An example of a convolutional layer implementation that can process two pixels per clock cycle.
Figure 5: A non-transposed KPU that can process two pixels per clock cycle.
...and 1 more figures

Data-Rate-Aware High-Speed CNN Inference on FPGAs

TL;DR

Abstract

Data-Rate-Aware High-Speed CNN Inference on FPGAs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)