Table of Contents
Fetching ...

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Jialei Xu, Xianming Liu, Junjun Jiang, Kui Jiang, Rui Li, Kai Cheng, Xiangyang Ji

TL;DR

The paper tackles robust monocular depth estimation in challenging conditions by fusing RGB and thermal data. It introduces independent coarse-depth estimators for each modality, a confidence predictor that identifies reliable depth cues, and a confidence-guided fusion network with 3D cross-modal alignment to produce final depth. Key contributions include the first explicit RGB–THR monocular depth fusion, a confidence loss to supervise cue selection, and an end-to-end fusion framework that achieves state-of-the-art results on MS^2 and ViViD++. The work advances reliable 3D perception for scenarios with poor lighting or adverse weather, with potential impact on autonomous systems and robotics.

Abstract

Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging MS$^2$ and ViViD++ datasets demonstrate the effectiveness and robustness of our method.

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

TL;DR

The paper tackles robust monocular depth estimation in challenging conditions by fusing RGB and thermal data. It introduces independent coarse-depth estimators for each modality, a confidence predictor that identifies reliable depth cues, and a confidence-guided fusion network with 3D cross-modal alignment to produce final depth. Key contributions include the first explicit RGB–THR monocular depth fusion, a confidence loss to supervise cue selection, and an end-to-end fusion framework that achieves state-of-the-art results on MS^2 and ViViD++. The work advances reliable 3D perception for scenarios with poor lighting or adverse weather, with potential impact on autonomous systems and robotics.

Abstract

Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging MS and ViViD++ datasets demonstrate the effectiveness and robustness of our method.
Paper Structure (18 sections, 10 equations, 5 figures, 5 tables)

This paper contains 18 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A comparison of depth prediction results before and after fusing thermal (THR) images. Our method combines depth cues from cross-perspective and cross-modality inputs, yielding high-quality depth in challenging scenes.
  • Figure 2: Overall pipeline of our proposed depth estimation via cross-modal fusion of RGB and thermal images (THR). We first employ two distinct depth networks to estimate the coarse depth map of each modality. Using 3D cross-modal transformation, we ensure that the information of the two modalities can be aligned in the same perspective. The confidence map (i.e., $\mathbf{C}_{RGB}$ and $\mathbf{C}_{THR}$) calculated by the confidence predictor network can identify which modality coarse depth can more realistically reflect the 3D scene. With the guidance of confidence maps, the fusion network fuses the advantages of both modalities to refine the coarse depth for the final depth prediction.
  • Figure 3: Error maps for predicting depth from RGB and thermal images respectively. (a) RGB (b) THR (c) Error map of depth prediction for RGB image (d) Error image of depth prediction for thermal image. The error map represents the relative error between the predicted depth and the ground-truth. We specify values for different color response errors and overlay them on RGB images.
  • Figure 4: Qualitative results of our method on the MS$^2$ dataset shin2023deep. (a) RGB (b) THR (c) Predicted depth from RGB image only (d) Predicted depth by our method.
  • Figure 5: Visualization of depth error map of RGB and the corresponding confidence map. The error map is computed between the ground-truth and predicted depth map. The predicted confidence map reflects the dominant region of each modality.