A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion
Xiaoli Zhang, Liying Wang, Libo Zhao, Xiongfei Li, Siwei Ma
TL;DR
This work tackles infrared-visible image fusion by explicitly modeling cross-modal relations and preserving high-frequency details while enhancing downstream tasks. It introduces SMFNet, a semantic-aware, multi-guided network with a three-branch encoder (CAI, BFE, GR), a three-stream fusion strategy, and a two-stage training regime guided by reconstruction, semantic, and correlation-based losses. The Graph Reasoning module captures high-level modality interactions; CAI preserves fine details; and the BFE captures long-range dependencies, with Gram-based semantic loss reinforcing meaningful textures. Across fusion benchmarks and downstream tasks such as object detection and semantic segmentation, SMFNet achieves state-of-the-art or competitive results with efficient runtime, and extends to medical image fusion, validating its versatility and practical impact.
Abstract
Multi-modality image fusion aims at fusing modality-specific (complementarity) and modality-shared (correlation) information from multiple source images. To tackle the problem of the neglect of inter-feature relationships, high-frequency information loss, and the limited attention to downstream tasks, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary information and aggregating multi-guided features. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. Firstly, shallow features from individual modalities are extracted by a depthwise convolution layer combined with the transformer block. In the three parallel branches of the encoder, Cross Attention and Invertible Block (CAI) extracts local features and preserves high-frequency texture details. Base Feature Extraction Module (BFE) captures long-range dependencies and enhances modality-shared information. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and simultaneously extract low-level detail features as CAI's modality-specific complementary information. Experiments demonstrate the competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, the proposed algorithm surpasses the state-of-the-art methods in terms of subsequent tasks, averagely scoring 8.27% mAP@0.5 higher in object detection and 5.85% mIoU higher in semantic segmentation. The code is avaliable at https://github.com/Abraham-Einstein/SMFNet/.
