Table of Contents
Fetching ...

A Composable Dynamic Sparse Dataflow Architecture for Efficient Event-based Vision Processing on FPGA

Yizhao Gao, Baoheng Zhang, Yuhao Ding, Hayden Kwok-Hay So

TL;DR

This work tackles the challenge of real-time, energy-efficient inference for event-based vision by exploiting intrinsic spatial sparsity with a modular, on-chip sparse dataflow on FPGA. The authors introduce ESDA, a composable framework of uniform token-feature modules that implement submanifold sparse convolution, enabling rapid construction of customized accelerators for each model. A sparsity-aware co-optimization flow guides hardware mapping and model search, yielding substantial speedups and energy savings across multiple datasets and comparisons with dense DNNs, embedded GPUs, and neuromorphic hardware. The approach advances edge intelligence for event-based vision by delivering low-latency, low-power DNN inference with a flexible design space for real-world deployments.

Abstract

Event-based vision represents a paradigm shift in how vision information is captured and processed. By only responding to dynamic intensity changes in the scene, event-based sensing produces far less data than conventional frame-based cameras, promising to springboard a new generation of high-speed, low-power machines for edge intelligence. However, processing such dynamically sparse input originated from event cameras efficiently in real time, particularly with complex deep neural networks (DNN), remains a formidable challenge. Existing solutions that employ GPUs and other frame-based DNN accelerators often struggle to efficiently process the dynamically sparse event data, missing the opportunities to improve processing efficiency with sparse data. To address this, we propose ESDA, a composable dynamic sparse dataflow architecture that allows customized DNN accelerators to be constructed rapidly on FPGAs for event-based vision tasks. ESDA is a modular system that is composed of a set of parametrizable modules for each network layer type. These modules share a uniform sparse token-feature interface and can be connected easily to compose an all-on-chip dataflow accelerator on FPGA for each network model. To fully exploit the intrinsic sparsity in event data, ESDA incorporates the use of submanifold sparse convolutions that largely enhance the activation sparsity throughout the layers while simplifying hardware implementation. Finally, a network architecture and hardware implementation co-optimizing framework that allows tradeoffs between accuracy and performance is also presented. Experimental results demonstrate that when compared with existing GPU and hardware-accelerated solutions, ESDA achieves substantial speedup and improvement in energy efficiency across different applications, and it allows much wider design space for real-world deployments.

A Composable Dynamic Sparse Dataflow Architecture for Efficient Event-based Vision Processing on FPGA

TL;DR

This work tackles the challenge of real-time, energy-efficient inference for event-based vision by exploiting intrinsic spatial sparsity with a modular, on-chip sparse dataflow on FPGA. The authors introduce ESDA, a composable framework of uniform token-feature modules that implement submanifold sparse convolution, enabling rapid construction of customized accelerators for each model. A sparsity-aware co-optimization flow guides hardware mapping and model search, yielding substantial speedups and energy savings across multiple datasets and comparisons with dense DNNs, embedded GPUs, and neuromorphic hardware. The approach advances edge intelligence for event-based vision by delivering low-latency, low-power DNN inference with a flexible design space for real-world deployments.

Abstract

Event-based vision represents a paradigm shift in how vision information is captured and processed. By only responding to dynamic intensity changes in the scene, event-based sensing produces far less data than conventional frame-based cameras, promising to springboard a new generation of high-speed, low-power machines for edge intelligence. However, processing such dynamically sparse input originated from event cameras efficiently in real time, particularly with complex deep neural networks (DNN), remains a formidable challenge. Existing solutions that employ GPUs and other frame-based DNN accelerators often struggle to efficiently process the dynamically sparse event data, missing the opportunities to improve processing efficiency with sparse data. To address this, we propose ESDA, a composable dynamic sparse dataflow architecture that allows customized DNN accelerators to be constructed rapidly on FPGAs for event-based vision tasks. ESDA is a modular system that is composed of a set of parametrizable modules for each network layer type. These modules share a uniform sparse token-feature interface and can be connected easily to compose an all-on-chip dataflow accelerator on FPGA for each network model. To fully exploit the intrinsic sparsity in event data, ESDA incorporates the use of submanifold sparse convolutions that largely enhance the activation sparsity throughout the layers while simplifying hardware implementation. Finally, a network architecture and hardware implementation co-optimizing framework that allows tradeoffs between accuracy and performance is also presented. Experimental results demonstrate that when compared with existing GPU and hardware-accelerated solutions, ESDA achieves substantial speedup and improvement in energy efficiency across different applications, and it allows much wider design space for real-world deployments.
Paper Structure (25 sections, 6 equations, 14 figures, 1 table)

This paper contains 25 sections, 6 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Working principle of event camera. The event camera only captures the light intensity change as spiking events in AER format ([x, y, ±1, timestamp]). This figure shows a recording sample from DvsGesture dataset where a man is rotating his left arm counter-clockwise. Due to the dynamic nature of event cameras, only the motions of the man are captured. For vision tasks like object recognition, a certain amount of events are usually grouped to form a 2D representation as DNN input.
  • Figure 2: Overall architecture of an ESDA accelerator.
  • Figure 3: Compare standard convolution with submanifold sparse convolution. The gray/green locations in the figure mean non-zero pixels. (a) When stride $s = 1$, the input and output location of submanifold convolution is restricted to be identical. In the "x" location, standard convolution leads to a valid non-zero output while submanifold convolution does not. (b) When stride $s>1$ (2 in the figure), an output location is non-zero if the corresponding input $s\times s$ grid contains non-zeros.
  • Figure 4: Convolution $1\times 1$ Module.
  • Figure 5: Convolution $3\times 3$ Module.
  • ...and 9 more figures