Table of Contents
Fetching ...

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

Xiaoli Zhang, Liying Wang, Libo Zhao, Xiongfei Li, Siwei Ma

TL;DR

This work tackles infrared-visible image fusion by explicitly modeling cross-modal relations and preserving high-frequency details while enhancing downstream tasks. It introduces SMFNet, a semantic-aware, multi-guided network with a three-branch encoder (CAI, BFE, GR), a three-stream fusion strategy, and a two-stage training regime guided by reconstruction, semantic, and correlation-based losses. The Graph Reasoning module captures high-level modality interactions; CAI preserves fine details; and the BFE captures long-range dependencies, with Gram-based semantic loss reinforcing meaningful textures. Across fusion benchmarks and downstream tasks such as object detection and semantic segmentation, SMFNet achieves state-of-the-art or competitive results with efficient runtime, and extends to medical image fusion, validating its versatility and practical impact.

Abstract

Multi-modality image fusion aims at fusing modality-specific (complementarity) and modality-shared (correlation) information from multiple source images. To tackle the problem of the neglect of inter-feature relationships, high-frequency information loss, and the limited attention to downstream tasks, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary information and aggregating multi-guided features. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. Firstly, shallow features from individual modalities are extracted by a depthwise convolution layer combined with the transformer block. In the three parallel branches of the encoder, Cross Attention and Invertible Block (CAI) extracts local features and preserves high-frequency texture details. Base Feature Extraction Module (BFE) captures long-range dependencies and enhances modality-shared information. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and simultaneously extract low-level detail features as CAI's modality-specific complementary information. Experiments demonstrate the competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, the proposed algorithm surpasses the state-of-the-art methods in terms of subsequent tasks, averagely scoring 8.27% mAP@0.5 higher in object detection and 5.85% mIoU higher in semantic segmentation. The code is avaliable at https://github.com/Abraham-Einstein/SMFNet/.

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

TL;DR

This work tackles infrared-visible image fusion by explicitly modeling cross-modal relations and preserving high-frequency details while enhancing downstream tasks. It introduces SMFNet, a semantic-aware, multi-guided network with a three-branch encoder (CAI, BFE, GR), a three-stream fusion strategy, and a two-stage training regime guided by reconstruction, semantic, and correlation-based losses. The Graph Reasoning module captures high-level modality interactions; CAI preserves fine details; and the BFE captures long-range dependencies, with Gram-based semantic loss reinforcing meaningful textures. Across fusion benchmarks and downstream tasks such as object detection and semantic segmentation, SMFNet achieves state-of-the-art or competitive results with efficient runtime, and extends to medical image fusion, validating its versatility and practical impact.

Abstract

Multi-modality image fusion aims at fusing modality-specific (complementarity) and modality-shared (correlation) information from multiple source images. To tackle the problem of the neglect of inter-feature relationships, high-frequency information loss, and the limited attention to downstream tasks, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary information and aggregating multi-guided features. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. Firstly, shallow features from individual modalities are extracted by a depthwise convolution layer combined with the transformer block. In the three parallel branches of the encoder, Cross Attention and Invertible Block (CAI) extracts local features and preserves high-frequency texture details. Base Feature Extraction Module (BFE) captures long-range dependencies and enhances modality-shared information. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and simultaneously extract low-level detail features as CAI's modality-specific complementary information. Experiments demonstrate the competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, the proposed algorithm surpasses the state-of-the-art methods in terms of subsequent tasks, averagely scoring 8.27% mAP@0.5 higher in object detection and 5.85% mIoU higher in semantic segmentation. The code is avaliable at https://github.com/Abraham-Einstein/SMFNet/.
Paper Structure (36 sections, 31 equations, 12 figures, 8 tables)

This paper contains 36 sections, 31 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: (a) Pipeline of typical Auto-Encoder-based methods, consisting of encoders, a decoder and auxiliary fusion strategies; (b) Our proposed semantic-aware and multi-guided network; (c) We show the qualitative results of fused images on TNO and RoadScene datasets compared with eleven state-of-the-art methods via the radar plots.
  • Figure 2: The architecture of proposed SMFNet. (a) In training stage I, the reconstructed original images are obtained from a three-branch encoder and a decoder framework. Features extracted from CAI, BFE and GR are aggregated and then fed into the decoder. (b) In training stage II, modality-specific, modality-shared and graph low-level features are further fused by the proposed fusion strategy. Then decoder generates the fused image. (c) CAI shows the process of refining fine-grained features and retaining high-frequency information using cross-attention mechanism and invertible module. (d) BFE enhances long-range dependencies via residual connections. (e) GR based on GNN can model the high-level relationships between two modalities shallow features and reason low-level detail features simultaneously.
  • Figure 3: Specific illustration for nodes and edges generating process in graph reasoning module.
  • Figure 4: Proposed detail fusion layers in training stage II.
  • Figure 5: Qualitative fusion results on TNO toet2012progress dataset. Magnified areas from (a) to (n) show the performance of the shade board on the eave and the traffic sign arrow pointing downwards to the left, corresponding to the visible, infrared, and fused images generated by the state-of-the-art methods.
  • ...and 7 more figures