Language-Depth Navigated Thermal and Visible Image Fusion
Jinchang Zhang, Zijun Li, Guoyu Lu
TL;DR
This work proposes a depth-guided, language-aware fusion framework for infrared and visible images that jointly leverages diffusion-based multi-channel feature extraction and depth supervision to enhance 3D reconstruction and robotics perception. It introduces two depth-estimation branches and a depth-driven loss to guide fusion, along with a Depth-Informed Image Captioning Network and CLIP-based semantic guidance to modulate fusion features via language. Empirical results on LLVIP, RoadScene, KAIST, and other datasets demonstrate improved depth-aware fusion quality and robust performance against state-of-the-art methods, particularly in low-light or cluttered environments. The approach shows practical promise for robust scene understanding, navigation, localization, and environmental perception in autonomous and rescue scenarios.
Abstract
Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.
