Table of Contents
Fetching ...

RN-Net: Reservoir Nodes-Enabled Neuromorphic Vision Sensing Network

Sangmin Yoo, Eric Yeu-Jer Lee, Ziyu Wang, Xinxin Wang, Wei D. Lu

TL;DR

The paper addresses the challenge of efficiently processing asynchronous event streams from event-based cameras without expensive frame-based descriptors or costly recurrent networks. It introduces RN-Net, a hybrid architecture that places two reservoir-node layers (R_in for local temporal encoding and R_f for global temporal encoding) ahead of and within conventional CNN blocks, using STM memristor-based dynamics to encode temporal information in real time with standard backpropagation training. RN-Net achieves state-of-the-art or near state-of-the-art accuracy on CIFAR10-DVS, N-Caltech 101, DVS128 Gesture, N-CARS, and DVS Lip with a compact network, while delivering favorable power estimates (around 10–12 mW per video) due to on-sensor temporal encoding and absence of heavy recurrent units. The work demonstrates a practical, hardware-friendly route for low-cost, real-time neuromorphic vision processing and suggests future extensions with emerging devices such as optical neural transistors. $G_t = P_c*(G_{max}-G_{t-1})* abla\delta_{spk}(t) + G_{t-1}*e^{- rac{1}{\tau}}$, with $P_c$ and $\tau$ tuning local/global memory, underpins the reservoir dynamics that enable rich temporal feature representations.$

Abstract

Event-based cameras are inspired by the sparse and asynchronous spike representation of the biological visual system. However, processing the event data requires either using expensive feature descriptors to transform spikes into frames, or using spiking neural networks that are expensive to train. In this work, we propose a neural network architecture, Reservoir Nodes-enabled neuromorphic vision sensing Network (RN-Net), based on simple convolution layers integrated with dynamic temporal encoding reservoirs for local and global spatiotemporal feature detection with low hardware and training costs. The RN-Net allows efficient processing of asynchronous temporal features, and achieves the highest accuracy of 99.2% for DVS128 Gesture reported to date, and one of the highest accuracy of 67.5% for DVS Lip dataset at a much smaller network size. By leveraging the internal device and circuit dynamics, asynchronous temporal feature encoding can be implemented at very low hardware cost without preprocessing and dedicated memory and arithmetic units. The use of simple DNN blocks and standard backpropagation-based training rules further reduces implementation costs.

RN-Net: Reservoir Nodes-Enabled Neuromorphic Vision Sensing Network

TL;DR

The paper addresses the challenge of efficiently processing asynchronous event streams from event-based cameras without expensive frame-based descriptors or costly recurrent networks. It introduces RN-Net, a hybrid architecture that places two reservoir-node layers (R_in for local temporal encoding and R_f for global temporal encoding) ahead of and within conventional CNN blocks, using STM memristor-based dynamics to encode temporal information in real time with standard backpropagation training. RN-Net achieves state-of-the-art or near state-of-the-art accuracy on CIFAR10-DVS, N-Caltech 101, DVS128 Gesture, N-CARS, and DVS Lip with a compact network, while delivering favorable power estimates (around 10–12 mW per video) due to on-sensor temporal encoding and absence of heavy recurrent units. The work demonstrates a practical, hardware-friendly route for low-cost, real-time neuromorphic vision processing and suggests future extensions with emerging devices such as optical neural transistors. , with and tuning local/global memory, underpins the reservoir dynamics that enable rich temporal feature representations.$

Abstract

Event-based cameras are inspired by the sparse and asynchronous spike representation of the biological visual system. However, processing the event data requires either using expensive feature descriptors to transform spikes into frames, or using spiking neural networks that are expensive to train. In this work, we propose a neural network architecture, Reservoir Nodes-enabled neuromorphic vision sensing Network (RN-Net), based on simple convolution layers integrated with dynamic temporal encoding reservoirs for local and global spatiotemporal feature detection with low hardware and training costs. The RN-Net allows efficient processing of asynchronous temporal features, and achieves the highest accuracy of 99.2% for DVS128 Gesture reported to date, and one of the highest accuracy of 67.5% for DVS Lip dataset at a much smaller network size. By leveraging the internal device and circuit dynamics, asynchronous temporal feature encoding can be implemented at very low hardware cost without preprocessing and dedicated memory and arithmetic units. The use of simple DNN blocks and standard backpropagation-based training rules further reduces implementation costs.
Paper Structure (19 sections, 9 equations, 6 figures, 8 tables)

This paper contains 19 sections, 9 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: RN-Net structure. $\emph{R}_{in}$ and $\emph{R}_{f}$ are reservoir layers for local and global temporal feature encoding, respectively. Bottom left: outputs from $\emph{R}_{in}$ from a representative input in the DVS Lip dataset. Bottom right: outputs from $\emph{R}_{f}$. Outputs from $\emph{R}_{f}$ are reshaped in 2D for better visualization. Deeper color in $\emph{R}_{in}$, $\emph{R}_{f}$ and output layers of Convolution (Conv) blocks and Fully-Connected ($\emph{FC}_{1-2}$) layers represents a higher analog value. Hidden layers within Conv blocks are not presented. The DNN structure for DVS Lip dataset is representatively illustrated. $\emph{C}_{N}$, $\emph{D}_{N}$, $\emph{MP}$, $\emph{FC}_{N}$ represent N-th Conv layer, kernel Depth of N-th Conv layer, Max Pooling layer and N-th FC layer, respectively.
  • Figure 2: Dynamics of a reservoir node (blue) and a time surface node (red) under identical spikes (black) shown below.
  • Figure 3: Temporally consecutive states of the input reservoir nodes, responding to asynchronous events in the (a) DVS128 Gesture and (b) DVS Lip datasets. Each state is retrieved at a constant time interval of 30ms. Deeper red represents a higher amplitude value of an RN state.
  • Figure 4: (a) Visualization of asynchronous input spikes and (b) spikes generated after the spike conversion (SC) layer.
  • Figure 5: Visualization of outputs from $\emph{R}_{f}$ over the whole 1.5s clip, along with spikes generated from the spike conversion layer. For Data2 whose input video length is only 0.9s, the RN states will continue to relax and still used for classification.
  • ...and 1 more figures