Table of Contents
Fetching ...

Learning Monocular Depth from Events via Egomotion Compensation

Haitao Meng, Chonghao Zhong, Sheng Tang, Lian JunJia, Wenwei Lin, Zhenshan Bing, Yi Chang, Gang Chen, Alois Knoll

TL;DR

This work tackles monocular depth estimation from event cameras by introducing a physics-informed framework that uses egomotion compensation to evaluate depth hypotheses. Key innovations include Focus Cost Discrimination (FCD), which quantifies edge-focused focus quality from gradient-based features, and Inter-hypotheses Cost Aggregation (IHCA), which refines depth costs via trend analysis and multi-scale consistency. By modeling a depth-dependent motion warp $\mathcal{M}(d)$ and forming Image of Warped Events $I(\mathcal{M}(d))$, the method produces metric-scale depth without relying on scale-ambiguous supervision. Experiments on MVSEC and EventCitySim show state-of-the-art or competitive performance with robustness to velocity noise, highlighting the practical impact of combining physical motion priors with learned cost aggregation for event-based depth estimation.

Abstract

Event cameras are neuromorphically inspired sensors that sparsely and asynchronously report brightness changes. Their unique characteristics of high temporal resolution, high dynamic range, and low power consumption make them well-suited for addressing challenges in monocular depth estimation (e.g., high-speed or low-lighting conditions). However, current existing methods primarily treat event streams as black-box learning systems without incorporating prior physical principles, thus becoming over-parameterized and failing to fully exploit the rich temporal information inherent in event camera data. To address this limitation, we incorporate physical motion principles to propose an interpretable monocular depth estimation framework, where the likelihood of various depth hypotheses is explicitly determined by the effect of motion compensation. To achieve this, we propose a Focus Cost Discrimination (FCD) module that measures the clarity of edges as an essential indicator of focus level and integrates spatial surroundings to facilitate cost estimation. Furthermore, we analyze the noise patterns within our framework and improve it with the newly introduced Inter-Hypotheses Cost Aggregation (IHCA) module, where the cost volume is refined through cost trend prediction and multi-scale cost consistency constraints. Extensive experiments on real-world and synthetic datasets demonstrate that our proposed framework outperforms cutting-edge methods by up to 10\% in terms of the absolute relative error metric, revealing superior performance in predicting accuracy.

Learning Monocular Depth from Events via Egomotion Compensation

TL;DR

This work tackles monocular depth estimation from event cameras by introducing a physics-informed framework that uses egomotion compensation to evaluate depth hypotheses. Key innovations include Focus Cost Discrimination (FCD), which quantifies edge-focused focus quality from gradient-based features, and Inter-hypotheses Cost Aggregation (IHCA), which refines depth costs via trend analysis and multi-scale consistency. By modeling a depth-dependent motion warp and forming Image of Warped Events , the method produces metric-scale depth without relying on scale-ambiguous supervision. Experiments on MVSEC and EventCitySim show state-of-the-art or competitive performance with robustness to velocity noise, highlighting the practical impact of combining physical motion priors with learned cost aggregation for event-based depth estimation.

Abstract

Event cameras are neuromorphically inspired sensors that sparsely and asynchronously report brightness changes. Their unique characteristics of high temporal resolution, high dynamic range, and low power consumption make them well-suited for addressing challenges in monocular depth estimation (e.g., high-speed or low-lighting conditions). However, current existing methods primarily treat event streams as black-box learning systems without incorporating prior physical principles, thus becoming over-parameterized and failing to fully exploit the rich temporal information inherent in event camera data. To address this limitation, we incorporate physical motion principles to propose an interpretable monocular depth estimation framework, where the likelihood of various depth hypotheses is explicitly determined by the effect of motion compensation. To achieve this, we propose a Focus Cost Discrimination (FCD) module that measures the clarity of edges as an essential indicator of focus level and integrates spatial surroundings to facilitate cost estimation. Furthermore, we analyze the noise patterns within our framework and improve it with the newly introduced Inter-Hypotheses Cost Aggregation (IHCA) module, where the cost volume is refined through cost trend prediction and multi-scale cost consistency constraints. Extensive experiments on real-world and synthetic datasets demonstrate that our proposed framework outperforms cutting-edge methods by up to 10\% in terms of the absolute relative error metric, revealing superior performance in predicting accuracy.

Paper Structure

This paper contains 12 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Image of wrapped event examples with different motion models under different depth hypotheses $d$, as well as the comparison of untouched ambiguous event image, warpped event image by using our depth prediction, the ground truth and our depth prediction. The red dashed boxes indicate the portion of IWE that are focused (with the correct depth hypothesis). The green dashed box highlights the erroneously focused events caused by the repetitive texture of the tree. Better view in the color mode.
  • Figure 2: The overview of our proposed monocular event depth estimation framework.
  • Figure 3: The representative examples of our model in comparison with other state of the arts. The first and the third rows illustrate the APS image, the event data (converted as colored event image), and the ground truth depth map. The second and the fourth rows indicate the depth estimations of APS+IEBins shao2023IEBins, E2Dlearningdense, EReFormer tian, and ours, respectively.