Table of Contents
Fetching ...

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion

Xingyuan Li, Yang Zou, Jinyuan Liu, Zhiying Jiang, Long Ma, Xin Fan, Risheng Liu

TL;DR

This work tackles the semantic gap in infrared-visible image fusion by introducing a text-guided fusion framework that leverages CLIP-based text semantics to align two modalities. It integrates a multi-level feature extractor, a text-guided attention fusion module, and a codebook with a bilevel optimization scheme that jointly optimizes fusion output and object detection, guided by losses $\mathcal{L}^{str}$ and $\mathcal{L}^{cc}$. A novel paired IVIF-text dataset is introduced to benchmark detection performance under text guidance. Across three IVIF datasets, the method achieves state-of-the-art fusion quality and higher detection mAP, demonstrating robustness and practical impact for night-vision and surveillance tasks.

Abstract

With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion

TL;DR

This work tackles the semantic gap in infrared-visible image fusion by introducing a text-guided fusion framework that leverages CLIP-based text semantics to align two modalities. It integrates a multi-level feature extractor, a text-guided attention fusion module, and a codebook with a bilevel optimization scheme that jointly optimizes fusion output and object detection, guided by losses and . A novel paired IVIF-text dataset is introduced to benchmark detection performance under text guidance. Across three IVIF datasets, the method achieves state-of-the-art fusion quality and higher detection mAP, demonstrating robustness and practical impact for night-vision and surveillance tasks.

Abstract

With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
Paper Structure (17 sections, 11 equations, 12 figures, 3 tables)

This paper contains 17 sections, 11 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Schematic representation of semantic integration from textual descriptions into infrared and visible images to enhance object detection efficacy.
  • Figure 2: The overview architecture of the our proposed text-guided fusion for multi-modal image fusion and object detection.
  • Figure 3: The procedure of our text-guided attention mechanism.
  • Figure 4: The framework of the bilevel optimization process.
  • Figure 5: Comparative visual fusion of our proposed method versus state-of-the-art methods on three typical image pairs in $\text{M}^{3}\text{FD}$, TNO, and RoadScene datasets.
  • ...and 7 more figures