Table of Contents
Fetching ...

Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion

Xilai Li, Xiaosong Li, Weijun Jiang

TL;DR

Multi-modality image fusion often suffers from gradient conflicts when sharing parameters across modalities with large differences. This work introduces UP-Fusion, a unified MMIF framework that combines Semantic-Aware Channel Pruning (SCPM), Geometric Affine Modulation (GAM), and Text-Guided Channel Perturbation (TCPM) with pre-trained knowledge to suppress modality redundancy and preserve cross-modal discriminability. SCPM leverages channel attention plus semantic priors from a pre-trained ConvNeXt; GAM applies geometry-informed affine modulation; TCPM uses CLIP-guided text features to steer channel rearrangement during decoding, reducing modality bias. Extensive experiments on infrared-visible and medical image fusion tasks, along with downstream segmentation and detection, demonstrate superior performance and strong cross-task generalization against both task-specific and unified baselines, with ablations confirming the benefits of each component.

Abstract

Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.

Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion

TL;DR

Multi-modality image fusion often suffers from gradient conflicts when sharing parameters across modalities with large differences. This work introduces UP-Fusion, a unified MMIF framework that combines Semantic-Aware Channel Pruning (SCPM), Geometric Affine Modulation (GAM), and Text-Guided Channel Perturbation (TCPM) with pre-trained knowledge to suppress modality redundancy and preserve cross-modal discriminability. SCPM leverages channel attention plus semantic priors from a pre-trained ConvNeXt; GAM applies geometry-informed affine modulation; TCPM uses CLIP-guided text features to steer channel rearrangement during decoding, reducing modality bias. Extensive experiments on infrared-visible and medical image fusion tasks, along with downstream segmentation and detection, demonstrate superior performance and strong cross-task generalization against both task-specific and unified baselines, with ablations confirming the benefits of each component.

Abstract

Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.

Paper Structure

This paper contains 26 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of the Single AE algorithm r14, Multiple AE algorithm r37, and the proposed method across different MMIF tasks.
  • Figure 2: The overall framework of the proposed unified multi-modality image fusion framework.
  • Figure 3: Comparison of the proposed algorithm and medical image fusion methods on medical image tasks.
  • Figure 4: Comparison of different methods on infrared and visible image fusion, and medical image fusion tasks.
  • Figure 5: Different ablation results on medical image fusion.