Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios
Jialei Xu, Xianming Liu, Junjun Jiang, Kui Jiang, Rui Li, Kai Cheng, Xiangyang Ji
TL;DR
The paper tackles robust monocular depth estimation in challenging conditions by fusing RGB and thermal data. It introduces independent coarse-depth estimators for each modality, a confidence predictor that identifies reliable depth cues, and a confidence-guided fusion network with 3D cross-modal alignment to produce final depth. Key contributions include the first explicit RGB–THR monocular depth fusion, a confidence loss to supervise cue selection, and an end-to-end fusion framework that achieves state-of-the-art results on MS^2 and ViViD++. The work advances reliable 3D perception for scenarios with poor lighting or adverse weather, with potential impact on autonomous systems and robotics.
Abstract
Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging MS$^2$ and ViViD++ datasets demonstrate the effectiveness and robustness of our method.
