Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion
Xilai Li, Xiaosong Li, Weijun Jiang
TL;DR
Multi-modality image fusion often suffers from gradient conflicts when sharing parameters across modalities with large differences. This work introduces UP-Fusion, a unified MMIF framework that combines Semantic-Aware Channel Pruning (SCPM), Geometric Affine Modulation (GAM), and Text-Guided Channel Perturbation (TCPM) with pre-trained knowledge to suppress modality redundancy and preserve cross-modal discriminability. SCPM leverages channel attention plus semantic priors from a pre-trained ConvNeXt; GAM applies geometry-informed affine modulation; TCPM uses CLIP-guided text features to steer channel rearrangement during decoding, reducing modality bias. Extensive experiments on infrared-visible and medical image fusion tasks, along with downstream segmentation and detection, demonstrate superior performance and strong cross-task generalization against both task-specific and unified baselines, with ablations confirming the benefits of each component.
Abstract
Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.
