Table of Contents
Fetching ...

Language-Depth Navigated Thermal and Visible Image Fusion

Jinchang Zhang, Zijun Li, Guoyu Lu

TL;DR

This work proposes a depth-guided, language-aware fusion framework for infrared and visible images that jointly leverages diffusion-based multi-channel feature extraction and depth supervision to enhance 3D reconstruction and robotics perception. It introduces two depth-estimation branches and a depth-driven loss to guide fusion, along with a Depth-Informed Image Captioning Network and CLIP-based semantic guidance to modulate fusion features via language. Empirical results on LLVIP, RoadScene, KAIST, and other datasets demonstrate improved depth-aware fusion quality and robust performance against state-of-the-art methods, particularly in low-light or cluttered environments. The approach shows practical promise for robust scene understanding, navigation, localization, and environmental perception in autonomous and rescue scenarios.

Abstract

Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.

Language-Depth Navigated Thermal and Visible Image Fusion

TL;DR

This work proposes a depth-guided, language-aware fusion framework for infrared and visible images that jointly leverages diffusion-based multi-channel feature extraction and depth supervision to enhance 3D reconstruction and robotics perception. It introduces two depth-estimation branches and a depth-driven loss to guide fusion, along with a Depth-Informed Image Captioning Network and CLIP-based semantic guidance to modulate fusion features via language. Empirical results on LLVIP, RoadScene, KAIST, and other datasets demonstrate improved depth-aware fusion quality and robust performance against state-of-the-art methods, particularly in low-light or cluttered environments. The approach shows practical promise for robust scene understanding, navigation, localization, and environmental perception in autonomous and rescue scenarios.

Abstract

Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.

Paper Structure

This paper contains 16 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the framework. The visible and infrared images are combined as a four-channel input, serving as ground-truth for self-supervised training of the noise prediction network to extract multi-channel features. In the forward diffusion process, $I_0$ and $I_t$ represent the multi-channel input and the data at timestep $t$, respectively. $P(\cdot | \cdot)$ and $Q(\cdot | \cdot)$ denote the forward and reverse diffusion processes. The multi-channel fusion loss includes intensity loss and gradient loss. Two depth estimation networks are trained separately for visible and infrared images, generating corresponding depth, which are then input into the depth information description module to produce image-text descriptions containing depth information. Using the CLIP text encoder, text features are extracted, and an MLP predicts semantic information and parameters to guide multi-channel feature reconstruction of fused image. The fused image is processed through the two depth estimation networks, generating depth that are compared with ground-truth to calculate depth-driven loss, optimizing the fusion process.
  • Figure 2: Comparison of Depth Estimation for Visible, Infrared, and Fused Images: The visible, infrared, and fused images of images a and b, along with their corresponding depth, are arranged in sequence to visually demonstrate the depth estimation results across different modalities.
  • Figure 3: Infrared and visible image fusion experiment on "human” images
  • Figure 4: Experiments on infrared and visible image fusion and estimated depth on "street" images.