Table of Contents
Fetching ...

SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

Zhuoheng Gao, Yihao Li, Jiyao Zhang, Rui Zhao, Tong Wu, Hao Tang, Zhaofei Yu, Hao Dong, Guozhang Chen, Tiejun Huang

TL;DR

SpikeStereoNet addresses the challenge of stereo depth estimation from asynchronous spike streams by introducing a brain-inspired RSNN-based iterative refinement framework. It fuses multi-scale spike features, leverages a correlation pyramid, and uses adaptive ALIF neurons to iteratively improve disparity estimates, supervised by a composite loss balancing accuracy, firing rate, and membrane dynamics. The authors provide large synthetic and real spike datasets, demonstrate state-of-the-art performance and robust data efficiency, and show effective domain adaptation from synthetic to real spike data. This work advances neuromorphic stereo vision by enabling direct, high-temporal-resolution depth sensing from spike streams and offers benchmarks to accelerate future research.

Abstract

Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.

SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

TL;DR

SpikeStereoNet addresses the challenge of stereo depth estimation from asynchronous spike streams by introducing a brain-inspired RSNN-based iterative refinement framework. It fuses multi-scale spike features, leverages a correlation pyramid, and uses adaptive ALIF neurons to iteratively improve disparity estimates, supervised by a composite loss balancing accuracy, firing rate, and membrane dynamics. The authors provide large synthetic and real spike datasets, demonstrate state-of-the-art performance and robust data efficiency, and show effective domain adaptation from synthetic to real spike data. This work advances neuromorphic stereo vision by enabling direct, high-temporal-resolution depth sensing from spike streams and offers benchmarks to accelerate future research.

Abstract

Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.

Paper Structure

This paper contains 34 sections, 19 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: The overall pipeline is illustrated in the figure below. The upper path represents the training and evaluation process on synthetic dataset, while the lower path shows the transfer learning and testing procedure on real spike stream data by using the pre-trained model.
  • Figure 2: The pipeline of the proposed solution. (a) and (b) The components of the biological visual system, each corresponding to a specific module in the computational framework. (c) The overview of SpikeStereoNet: Multi-scale spike features are first extracted to construct a correlation pyramid, followed by a biologically inspired RSNN-based update operator that iteratively refines disparity using local cost volumes and contextual cues. The final disparity map are upsampled to produce high-resolution depth estimates.
  • Figure 3: Illustration of the detailed structure of spike feature extraction and the RSNN-based update module. (a) Spike feature extraction: It comprises one context network and two feature networks, which extract multi-scale correlation features, contextual features, and the initial hidden state from the spike streams. A single network structure is illustrated in the diagram. (b) RSNN-based update block: Local correlations and disparity fields are used to generate motion features, which update the RSNN hidden states through recurrent and feedforward connections. The RSNN at the highest resolution is responsible for refining the disparity estimates. (c) Descriptions of key modules.
  • Figure 4: From left to right: synthetic scene images from the left view, ground-truth depth, depth prediction results from existing stereo methods and our method.
  • Figure 5: Visual results of our method and competing approaches on the real dataset. The "Scene" refers to the gamma-transformed temporal average of spike streams. "Kinect" represents the raw depths captured by the depth camera.
  • ...and 8 more figures