Table of Contents
Fetching ...

TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion

Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Hui Li, Xi Li, Zhangyong Tang, Josef Kittler

TL;DR

TextFusion presents a text-guided image fusion framework that leverages text semantics to controllably fuse infrared and visible images. It introduces an affine fusion unit within a transformer-based architecture and a coarse-to-fine text–vision association to inject semantic guidance, along with a textual-attention based fusion assessment. A new IVT dataset with IR–Vis pairs and textual descriptions is released to enable multimodal training and evaluation. Empirical results show improved fusion quality, controllability via prompts, and enhanced performance for downstream tasks such as pedestrian detection, with comprehensive ablations validating the design choices.

Abstract

Advanced image fusion methods are devoted to generating the fusion results by aggregating the complementary information conveyed by the source images. However, the difference in the source-specific manifestation of the imaged scene content makes it difficult to design a robust and controllable fusion process. We argue that this issue can be alleviated with the help of higher-level semantics, conveyed by the text modality, which should enable us to generate fused images for different purposes, such as visualisation and downstream tasks, in a controllable way. This is achieved by exploiting a vision-and-language model to build a coarse-to-fine association mechanism between the text and image signals. With the guidance of the association maps, an affine fusion unit is embedded in the transformer network to fuse the text and vision modalities at the feature level. As another ingredient of this work, we propose the use of textual attention to adapt image quality assessment to the fusion task. To facilitate the implementation of the proposed text-guided fusion paradigm, and its adoption by the wider research community, we release a text-annotated image fusion dataset IVT. Extensive experiments demonstrate that our approach (TextFusion) consistently outperforms traditional appearance-based fusion methods. Our code and dataset will be publicly available at https://github.com/AWCXV/TextFusion.

TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion

TL;DR

TextFusion presents a text-guided image fusion framework that leverages text semantics to controllably fuse infrared and visible images. It introduces an affine fusion unit within a transformer-based architecture and a coarse-to-fine text–vision association to inject semantic guidance, along with a textual-attention based fusion assessment. A new IVT dataset with IR–Vis pairs and textual descriptions is released to enable multimodal training and evaluation. Empirical results show improved fusion quality, controllability via prompts, and enhanced performance for downstream tasks such as pedestrian detection, with comprehensive ablations validating the design choices.

Abstract

Advanced image fusion methods are devoted to generating the fusion results by aggregating the complementary information conveyed by the source images. However, the difference in the source-specific manifestation of the imaged scene content makes it difficult to design a robust and controllable fusion process. We argue that this issue can be alleviated with the help of higher-level semantics, conveyed by the text modality, which should enable us to generate fused images for different purposes, such as visualisation and downstream tasks, in a controllable way. This is achieved by exploiting a vision-and-language model to build a coarse-to-fine association mechanism between the text and image signals. With the guidance of the association maps, an affine fusion unit is embedded in the transformer network to fuse the text and vision modalities at the feature level. As another ingredient of this work, we propose the use of textual attention to adapt image quality assessment to the fusion task. To facilitate the implementation of the proposed text-guided fusion paradigm, and its adoption by the wider research community, we release a text-annotated image fusion dataset IVT. Extensive experiments demonstrate that our approach (TextFusion) consistently outperforms traditional appearance-based fusion methods. Our code and dataset will be publicly available at https://github.com/AWCXV/TextFusion.
Paper Structure (26 sections, 17 equations, 16 figures, 4 tables)

This paper contains 26 sections, 17 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: The input modalities and two fusion results obtained by our TextFusion with different description. As reflected in the metrics, each application scenario requires a different fusion scheme to achieve the best performance. (EN is the image quality metric of information entropy and mAP denotes the mean average detection precision.)
  • Figure 2: Existing learning-based image fusion methods and the proposed controllable image fusion paradigm. To generate appropriate fusion results for a specific scenario (different tasks or concerned objects), existing methods cannot realise it or require expensive retraining. The same goal can be achieved by simply adjusting the focused objectives of textual description in our paradigm. (DET-all: general detection; OB-all: observation for the whole scene; OB-part: observation for the interested regions)
  • Figure 3: An illustration of different annotation manners in the training and testing phases. Considering different annotation principles, our research group and independent volunteers are the observers of different stages in the text-guided image fusion paradigm, respectively, to avoid potential bias issues.
  • Figure 4: The demographic information of the annotation volunteers. (a) and (c) denote the age and research interest information of the specialists. (b) denotes the age information of the non-specialists. As shown in the charts, the observers range from different ages and areas, which can represent a general view of human beings for understanding the RGBT image pairs.
  • Figure 5: An illustration of the TextFusion model and the affine fusion unit design. (a): Our fusion model receives the input image pairs and the textual description as input. The text encoder from the CLIP model and two vision encoders based on the Swin Transformer Blocks are used to extract the hidden representations of source input. We further propose an affine fusion model to align and aggregate these features. Subsequently, we use a decoder consisting of convolutional layers to reconstruct the fused image. (b): The affine fusion unit is used to fuse the vision signals with the help of the text modality. In our design, the infrared modality is used to generate the weight term $\mu$, while the bias term $\lambda$ is calculated based on the visible images. We expand the spatial dimension of the text features to match the weight term prior to performing the element-wise multiplication operation.
  • ...and 11 more figures