Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space

Yuhao Wang; Lingjuan Miao; Zhiqiang Zhou; Lei Zhang; Yajun Qiao

Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space

Yuhao Wang, Lingjuan Miao, Zhiqiang Zhou, Lei Zhang, Yajun Qiao

TL;DR

This work tackles IVIF without ground-truth fused images by formulating a language-driven fusion objective encoded in CLIP space. A language-driven fusion model defines the desired fusion direction, and a dedicated loss guides the actual fusion toward that direction, supplemented by patch-based artifact regularization. The method achieves state-of-the-art fusion quality across multiple datasets and improves high-level task performance, such as object detection on fused images. The approach highlights the potential of language grounding and vision-language models to simplify and enhance multimodal image fusion with robust generalization.

Abstract

Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performance of existing fusion methods is limited. In this paper, we propose to use natural language to express the objective of IVIF, which can avoid the explicit mathematical modeling of fusion output in current losses, and make full use of the advantage of language expression to improve the fusion performance. For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. A language-driven fusion model is then constructed in the embedding space, by establishing the relationship among the embedded vectors representing the fusion objective and input image modalities. Finally, a language-driven loss is derived to make the actual IVIF aligned with the embedded language-driven fusion model via supervised training. Experiments show that our method can obtain much better fusion results than existing techniques. The code is available at https://github.com/wyhlaowang/LDFusion.

Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 10 figures, 5 tables)

This paper contains 23 sections, 6 equations, 10 figures, 5 tables.

Introduction
Related Work
Proposed Method
The Framework of Our Method
Language-Expressed Fusion Objective
Language Prompts for Source Images
Language-driven Fusion Model and Loss
Embedded Language-Driven Fusion Model
Language-Driven Fusion Loss
Patch Filter-Based Training for Artifact Removal
Fusion Network
Experiments
setup
Ablation Study
Ablation Study on Language-driven Fusion Loss
...and 8 more sections

Figures (10)

Figure 1: CLIP can perceive both visible and infrared images.
Figure 2: The Framework of the proposed method. The dashed line represents the language-driven training process, while the solid box denotes the inference process.
Figure 3: The schematic of the fusion process in CLIP embedding space. $\Delta V_{vs}$ and $\Delta V_{ir}$ jointly define the transition relationship established by the language-driven fusion model, while $\Delta\tilde{V}_{vs}$ and $\Delta\tilde{V}_{ir}$ denote the actual image fusion relationship.
Figure 4: Some examples of infrared (a) and visible (b) image fusion results without (c) and with (d) the patch filter based on information entropy.
Figure 5: The structure of the fusion network $G$.
...and 5 more figures

Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space

TL;DR

Abstract

Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space

Authors

TL;DR

Abstract

Table of Contents

Figures (10)