Table of Contents
Fetching ...

SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

Tianbo Pan, Zidong Cao, Lin Wang

TL;DR

This work tackles monocular depth estimation under challenging lighting by fusing frame and event data with spatial reliability. SRFNet introduces an Attention-based Interactive Fusion (AIF) module to learn consensus regions and guide inter-modal fusion, and a Reliability-oriented Depth Refinement (RDR) module that leverages temporal cues and NLSPN-based refinement to produce dense, sharp depth maps. The approach outperforms frame-only, event-only, and existing frame-event fusion methods, especially in night scenes, and does so without pretraining on synthetic data. The results on MVSEC and DENSE validate the method's robustness and generalization, with potential applicability to broader multi-modal perception tasks.

Abstract

Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: https://vlislab22.github.io/SRFNet.

SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

TL;DR

This work tackles monocular depth estimation under challenging lighting by fusing frame and event data with spatial reliability. SRFNet introduces an Attention-based Interactive Fusion (AIF) module to learn consensus regions and guide inter-modal fusion, and a Reliability-oriented Depth Refinement (RDR) module that leverages temporal cues and NLSPN-based refinement to produce dense, sharp depth maps. The approach outperforms frame-only, event-only, and existing frame-event fusion methods, especially in night scenes, and does so without pretraining on synthetic data. The results on MVSEC and DENSE validate the method's robustness and generalization, with potential applicability to broader multi-modal perception tasks.

Abstract

Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: https://vlislab22.github.io/SRFNet.
Paper Structure (11 sections, 7 equations, 6 figures, 4 tables)

This paper contains 11 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) and (c) are the input intensity frame and events, respectively; (b) is our result while (d) is the result of RAMNet Gehrig2021CombiningEA. Within the red rectangle, we zoom in on a challenging scenario where a car and tree in the frame are exposed under poor lighting and show discontinuous edge information in the events. Our SRFNet excels in predicting dense depth with fine-grained structural details compared with RAMNet.
  • Figure 2: Illustration of the initialization of modal-specific masks, $M_{i}$ for the frame and $M_{e}$ for the events.
  • Figure 3: (a) is the overview of our proposed framework, and (b) depicts the details of the AIF module.
  • Figure 4: Qualitative comparisons with different methods for MVSEC dataset. (c) RAMNet is purely trained on MVSEC; (d) RAMNet [S] denotes RAMNet pre-trained on synthetic dataset. The yellow bounding box indicates the region of significant contrast.
  • Figure 5: Qualitative results of ablation studies of SRFNet on the MVSEC dataset. (c) denotes the baseline; (d) is our SRFNet without the RDR module; (e) is the complete SRFNet.
  • ...and 1 more figures