Table of Contents
Fetching ...

DERD-Net: Learning Depth from Event-based Ray Densities

Diego Hitzges, Suman Ghosh, Guillermo Gallego

TL;DR

DERD-Net tackles depth estimation from event cameras by back-projecting events into a Disparity Space Image (DSI), a $D \times W \times H$ grid of ray-intersection counts. It processes fixed-size Sub-DSIs with a compact network that combines 3D convolutions and a GRU, enabling monocular and stereo depth predictions for selected pixels, with ensemble averaging to reduce variance. An adaptive Gaussian threshold selects reliable pixels, and the model achieves ultra-low memory usage (~70k parameters) and real-time inference while significantly increasing depth-density compared to prior methods. Across MVSEC and DSEC benchmarks, DERD-Net delivers state-of-the-art accuracy and robustness to pose noise, highlighting its potential as a scalable, SLAM-friendly depth estimation solution for event-based vision.

Abstract

Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: https://github.com/tub-rip/DERD-Net

DERD-Net: Learning Depth from Event-based Ray Densities

TL;DR

DERD-Net tackles depth estimation from event cameras by back-projecting events into a Disparity Space Image (DSI), a grid of ray-intersection counts. It processes fixed-size Sub-DSIs with a compact network that combines 3D convolutions and a GRU, enabling monocular and stereo depth predictions for selected pixels, with ensemble averaging to reduce variance. An adaptive Gaussian threshold selects reliable pixels, and the model achieves ultra-low memory usage (~70k parameters) and real-time inference while significantly increasing depth-density compared to prior methods. Across MVSEC and DSEC benchmarks, DERD-Net delivers state-of-the-art accuracy and robustness to pose noise, highlighting its potential as a scalable, SLAM-friendly depth estimation solution for event-based vision.

Abstract

Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: https://github.com/tub-rip/DERD-Net

Paper Structure

This paper contains 21 sections, 4 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Overview. We present a deep-learning--based method to predict depth from event-ray densities (Disparity Space Images --DSIs) obtained by back-projecting events using camera poses. Our deep neural network, DERD-Net, operates in parallel on local volumetric neighborhoods of the DSI data, called Sub-DSIs (in orange).
  • Figure 2: Network Architecture. The parameters of the network's modules are specified in \ref{['tab:nn_dimensions']}.
  • Figure 3: Depth estimation. Qualitative comparison of depth estimated using the MC-EMVS method Ghosh22aisy, applying it to the new selected pixels $F_\text{denser}$ and our method DERD-Net, for the MVSEC indoor_flyingZhu18ral (top 3 rows) and DSEC Zurich_City_04_a (bottom row) sequences. Ground truth depth from LiDAR is masked by pixels with valid depth estimate. Our method estimates depth even at pixels with no GT depth. Depth maps are pseudo-colored, from blue (close) to red (far), in the range 1-6.5m for MVSEC and 4-50m for DSEC.