Table of Contents
Fetching ...

Bridging Spectral-wise and Multi-spectral Depth Estimation via Geometry-guided Contrastive Learning

Ukcheol Shin, Kyunghyun Lee, Jean Oh

TL;DR

This work tackles robust monocular depth estimation across multiple spectral modalities by bridging spectral-wise and multi-spectral depth estimation through geometry-guided contrastive learning. It introduces Align-and-Fuse, a two-stage framework where the align stage learns spectral-shared representations via global and dense local contrastive losses ($L_{g}$ and $L_{l}$) grounded in geometric cues, and the fuse stage trains an attachable fusion module with an attention mechanism and a Swin-transformer to robustly combine multi-spectral features while freezing the base network. The method yields spectral generalization across RGB, NIR, and thermal inputs and improves multi-spectral depth accuracy, especially under challenging conditions like low light and rain, demonstrated on the MS^2 dataset with MiDaS-v2.1 and NeWCRF backbones. This approach offers a memory-efficient, flexible path to deploy depth estimation systems that maintain performance when integrating new spectral sensors, making it directly applicable to autonomous driving scenarios with diverse sensing modalities.

Abstract

Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cue. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.

Bridging Spectral-wise and Multi-spectral Depth Estimation via Geometry-guided Contrastive Learning

TL;DR

This work tackles robust monocular depth estimation across multiple spectral modalities by bridging spectral-wise and multi-spectral depth estimation through geometry-guided contrastive learning. It introduces Align-and-Fuse, a two-stage framework where the align stage learns spectral-shared representations via global and dense local contrastive losses ( and ) grounded in geometric cues, and the fuse stage trains an attachable fusion module with an attention mechanism and a Swin-transformer to robustly combine multi-spectral features while freezing the base network. The method yields spectral generalization across RGB, NIR, and thermal inputs and improves multi-spectral depth accuracy, especially under challenging conditions like low light and rain, demonstrated on the MS^2 dataset with MiDaS-v2.1 and NeWCRF backbones. This approach offers a memory-efficient, flexible path to deploy depth estimation systems that maintain performance when integrating new spectral sensors, making it directly applicable to autonomous driving scenarios with diverse sensing modalities.

Abstract

Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cue. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.

Paper Structure

This paper contains 21 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Spectral-wise and multi-spectral fused depth estimation. Our proposed method makes a single network that can estimate spectral-wise depth maps from each different spectral image. Also, the proposed attachable fusion module makes the network estimate a reliable and robust depth map under various adverse environments without degeneration of the spectrum generalization ability and modification of the original off-the-shelf network architecture.
  • Figure 2: Overall pipeline of our proposed training framework. Our proposed method trains an MDE network in a two-stage learning strategy, named align-and-fuse. In the align stage, the MDE network learns shared feature representation via geometry-guided contrastive learning by aligning latent spaces of multi-spectral images. After the first stage training, the fuse stage trains an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results.
  • Figure 3: Qualitative comparison of spectral-wise and multi-spectral depth estimation in day, night, and rainy conditions. Our proposed method enables a single network to estimate depth maps from each spectrum input. Also, the fusion modules make the network achieve robust and reliable depth estimation results against rain, occlusion, and glare effects.
  • Figure 4: Qualitative comparison with and without the feature fusion module.