Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Xunpeng Yi; Han Xu; Hao Zhang; Linfeng Tang; Jiayi Ma

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, Jiayi Ma

TL;DR

The paper tackles degradation-prone infrared-visible image fusion and user-driven customization by introducing Text-IF, a text-guided fusion framework. It fuses an image-pipeline with a text interaction module, leveraging a frozen CLIP text encoder and a semantic interaction guidance module to adapt fusion to degradations via semantic prompts. Key contributions include a Transformer-based image fusion pipeline with cross-attention, SIGM-aided text guidance, and semantically conditioned loss functions, validated across multiple datasets and downstream tasks. Results show enhanced fusion quality and robustness to degradations, with interactive text prompts enabling flexible, high-quality outputs that support practical applications such as object detection on fused imagery.

Abstract

Image fusion aims to combine information from different source images to create a comprehensively representative image. Existing fusion methods are typically helpless in dealing with degradations in low-quality source images and non-interactive to multiple subjective and objective needs. To solve them, we introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task, termed as Text-IF. It innovatively extends the classical image fusion to the text guided image fusion along with the ability to harmoniously address the degradation and interaction issues during fusion. Through the text semantic encoder and semantic interaction fusion decoder, Text-IF is accessible to the all-in-one infrared and visible image degradation-aware processing and the interactive flexible fusion outcomes. In this way, Text-IF achieves not only multi-modal image fusion, but also multi-modal information fusion. Extensive experiments prove that our proposed text guided image fusion strategy has obvious advantages over SOTA methods in the image fusion performance and degradation treatment. The code is available at https://github.com/XunpengYi/Text-IF.

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

TL;DR

Abstract

Paper Structure (14 sections, 14 equations, 6 figures, 4 tables)

This paper contains 14 sections, 14 equations, 6 figures, 4 tables.

Introduction
Related Work
The Proposed Method
Problem Formulation
Image Fusion Pipeline
Text Interaction Guidance Architecture
Loss Functions
Experiments
Implementation Details and Datasets
Comparison without Text Guidance
Comparison with Text Guidance
Performance on High-level Task
Ablation Experiment
Conclusion

Figures (6)

Figure 1: Fusion approaches for complex scenes with degradations. (a) simple fusion approach: treating image fusion with predefined fusion loss and not applicable to complex scenes with degradations. (b) separated approach: requiring frequent restoration methods switching according to the type of degradations, which is troublesome and not well-done. (c) proposed text guided image fusion approach: achieving interactive and high-quality fusion image without tedious replacement of models.
Figure 2: The workflow of Text-IF. It contains two important parts, which are the image fusion pipeline and the text semantic feature encoder. Text semantic features are used to guide image fusion through the Semantic Interaction Guidance Module (SIGM).
Figure 3: Qualitative comparison of our Text-IF without text guidance (without additional semantic information) and existing image fusion methods. From top to bottom: data from MSRS, two groups of data from LLVIP, and data from RoadScene datasets, respectively.
Figure 4: Comparison of our Text-IF with semantic text guidance and the combination of existing image restoration and fusion methods on degraded source images. The semantic text is reported above each group of images. Degradations from top to bottom: low-light visible (MSRS), low-light visible (LLVIP), low-contrast infrared (MFNet), noised infrared (DN-MSRS), over-exposed visible (RoadScene).
Figure 5: Qualitative comparison of object detection performance on LLVIP (without introducing additional semantic information).
...and 1 more figures

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

TL;DR

Abstract

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)