Table of Contents
Fetching ...

IVGF: The Fusion-Guided Infrared and Visible General Framework

Fangcen Liu, Chenqiang Gao, Fang Chen, Pengcheng Li, Junjie Guo, Deyu Meng

TL;DR

IVGF presents a general fusion-guided framework for infrared and visible imagery that generalizes to high-level vision tasks without requiring extensive paired data. It combines state-of-the-art infrared and visible foundation models with a feature enhancer, token enhancer, attention-guided fusion, and a cutout&mix augmentation strategy to effectively exploit cross-modal complementarity. Ablation and cross-task experiments demonstrate that each module contributes to improved semantic segmentation and object detection performance, and the method shows robustness to modality missing. The approach offers a scalable path to all-weather, dual-modality perception with practical impact for tasks like autonomous driving and surveillance.

Abstract

Infrared and visible dual-modality tasks such as semantic segmentation and object detection can achieve robust performance even in extreme scenes by fusing complementary information. Most current methods design task-specific frameworks, which are limited in generalization across multiple tasks. In this paper, we propose a fusion-guided infrared and visible general framework, IVGF, which can be easily extended to many high-level vision tasks. Firstly, we adopt the SOTA infrared and visible foundation models to extract the general representations. Then, to enrich the semantics information of these general representations for high-level vision tasks, we design the feature enhancement module and token enhancement module for feature maps and tokens, respectively. Besides, the attention-guided fusion module is proposed for effectively fusing by exploring the complementary information of two modalities. Moreover, we also adopt the cutout&mix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementary between the two modalities. Extensive experiments show that the IVGF outperforms state-of-the-art dual-modality methods in the semantic segmentation and object detection tasks. The detailed ablation studies demonstrate the effectiveness of each module, and another experiment explores the anti-missing modality ability of the proposed method in the dual-modality semantic segmentation task.

IVGF: The Fusion-Guided Infrared and Visible General Framework

TL;DR

IVGF presents a general fusion-guided framework for infrared and visible imagery that generalizes to high-level vision tasks without requiring extensive paired data. It combines state-of-the-art infrared and visible foundation models with a feature enhancer, token enhancer, attention-guided fusion, and a cutout&mix augmentation strategy to effectively exploit cross-modal complementarity. Ablation and cross-task experiments demonstrate that each module contributes to improved semantic segmentation and object detection performance, and the method shows robustness to modality missing. The approach offers a scalable path to all-weather, dual-modality perception with practical impact for tasks like autonomous driving and surveillance.

Abstract

Infrared and visible dual-modality tasks such as semantic segmentation and object detection can achieve robust performance even in extreme scenes by fusing complementary information. Most current methods design task-specific frameworks, which are limited in generalization across multiple tasks. In this paper, we propose a fusion-guided infrared and visible general framework, IVGF, which can be easily extended to many high-level vision tasks. Firstly, we adopt the SOTA infrared and visible foundation models to extract the general representations. Then, to enrich the semantics information of these general representations for high-level vision tasks, we design the feature enhancement module and token enhancement module for feature maps and tokens, respectively. Besides, the attention-guided fusion module is proposed for effectively fusing by exploring the complementary information of two modalities. Moreover, we also adopt the cutout&mix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementary between the two modalities. Extensive experiments show that the IVGF outperforms state-of-the-art dual-modality methods in the semantic segmentation and object detection tasks. The detailed ablation studies demonstrate the effectiveness of each module, and another experiment explores the anti-missing modality ability of the proposed method in the dual-modality semantic segmentation task.
Paper Structure (32 sections, 20 equations, 8 figures, 6 tables)

This paper contains 32 sections, 20 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Different structures for infrared and visible dual-modality high-level vision tasks include: (a) The image fusion-based methods usually obtain the fused images and then take them as inputs for downstream tasks. The main purpose of these methods is to obtain high-quality fusion images. (b) The task-specific design methods aim to address specific problems in a single task and propose novel frameworks. The generalization of these methods is limited. (c) The proposed general framework for infrared and visible modalities exhibits great generalization, easily being extended to both detection and semantic segmentation tasks by integrating with task heads.
  • Figure 2: The framework of the proposed method. It contains five main components: two modality-specific backbones, the feature enhancement module, the token enhancement module, the attention-guided fusion module, and the task-specific head. The backbone can extract the general representation of each modality. The feature enhancement and token enhancement modules are designed for feature maps and tokens, respectively. The attention-guided fusion module integrates the complementary information from two modal features.
  • Figure 3: The proposed feature enhancement module. It contains the cross-modality spatial integration and the intra-modality channel attention.
  • Figure 4: The proposed token enhancement module.
  • Figure 5: The proposed attention-guided fusion module.
  • ...and 3 more figures