RN-Net: Reservoir Nodes-Enabled Neuromorphic Vision Sensing Network
Sangmin Yoo, Eric Yeu-Jer Lee, Ziyu Wang, Xinxin Wang, Wei D. Lu
TL;DR
The paper addresses the challenge of efficiently processing asynchronous event streams from event-based cameras without expensive frame-based descriptors or costly recurrent networks. It introduces RN-Net, a hybrid architecture that places two reservoir-node layers (R_in for local temporal encoding and R_f for global temporal encoding) ahead of and within conventional CNN blocks, using STM memristor-based dynamics to encode temporal information in real time with standard backpropagation training. RN-Net achieves state-of-the-art or near state-of-the-art accuracy on CIFAR10-DVS, N-Caltech 101, DVS128 Gesture, N-CARS, and DVS Lip with a compact network, while delivering favorable power estimates (around 10–12 mW per video) due to on-sensor temporal encoding and absence of heavy recurrent units. The work demonstrates a practical, hardware-friendly route for low-cost, real-time neuromorphic vision processing and suggests future extensions with emerging devices such as optical neural transistors. $G_t = P_c*(G_{max}-G_{t-1})* abla\delta_{spk}(t) + G_{t-1}*e^{-rac{1}{\tau}}$, with $P_c$ and $\tau$ tuning local/global memory, underpins the reservoir dynamics that enable rich temporal feature representations.$
Abstract
Event-based cameras are inspired by the sparse and asynchronous spike representation of the biological visual system. However, processing the event data requires either using expensive feature descriptors to transform spikes into frames, or using spiking neural networks that are expensive to train. In this work, we propose a neural network architecture, Reservoir Nodes-enabled neuromorphic vision sensing Network (RN-Net), based on simple convolution layers integrated with dynamic temporal encoding reservoirs for local and global spatiotemporal feature detection with low hardware and training costs. The RN-Net allows efficient processing of asynchronous temporal features, and achieves the highest accuracy of 99.2% for DVS128 Gesture reported to date, and one of the highest accuracy of 67.5% for DVS Lip dataset at a much smaller network size. By leveraging the internal device and circuit dynamics, asynchronous temporal feature encoding can be implemented at very low hardware cost without preprocessing and dedicated memory and arithmetic units. The use of simple DNN blocks and standard backpropagation-based training rules further reduces implementation costs.
