Table of Contents
Fetching ...

CrossFuse: Learning Infrared and Visible Image Fusion by Cross-Sensor Top-K Vision Alignment and Beyond

Yukai Shi, Cidan Shi, Zhipeng Weng, Yin Tian, Xiaoyu Xian, Liang Lin

TL;DR

CrossFuse tackles out-of-distribution challenges in infrared-visible image fusion by adopting a data-centric approach that combines external Top-K Selective Channel Alignment with internal weak-aggressive augmentation. A frequency-aware fusion network integrates multi-scale features from infrared and visible modalities, enabling robust fusion under distribution shifts. Extensive experiments on RoadScene, MSRS, M3FD, and TNO demonstrate superior robustness and fusion quality across diverse conditions, highlighting improvements in both perceptual and objective metrics. The work advances practical deployment of IVIF in open-world settings by enhancing generalization and stability while outlining paths for efficiency optimization and broader applicability.

Abstract

Infrared and visible image fusion (IVIF) is increasingly applied in critical fields such as video surveillance and autonomous driving systems. Significant progress has been made in deep learning-based fusion methods. However, these models frequently encounter out-of-distribution (OOD) scenes in real-world applications, which severely impact their performance and reliability. Therefore, addressing the challenge of OOD data is crucial for the safe deployment of these models in open-world environments. Unlike existing research, our focus is on the challenges posed by OOD data in real-world applications and on enhancing the robustness and generalization of models. In this paper, we propose an infrared-visible fusion framework based on Multi-View Augmentation. For external data augmentation, Top-k Selective Vision Alignment is employed to mitigate distribution shifts between datasets by performing RGB-wise transformations on visible images. This strategy effectively introduces augmented samples, enhancing the adaptability of the model to complex real-world scenarios. Additionally, for internal data augmentation, self-supervised learning is established using Weak-Aggressive Augmentation. This enables the model to learn more robust and general feature representations during the fusion process, thereby improving robustness and generalization. Extensive experiments demonstrate that the proposed method exhibits superior performance and robustness across various conditions and environments. Our approach significantly enhances the reliability and stability of IVIF tasks in practical applications.

CrossFuse: Learning Infrared and Visible Image Fusion by Cross-Sensor Top-K Vision Alignment and Beyond

TL;DR

CrossFuse tackles out-of-distribution challenges in infrared-visible image fusion by adopting a data-centric approach that combines external Top-K Selective Channel Alignment with internal weak-aggressive augmentation. A frequency-aware fusion network integrates multi-scale features from infrared and visible modalities, enabling robust fusion under distribution shifts. Extensive experiments on RoadScene, MSRS, M3FD, and TNO demonstrate superior robustness and fusion quality across diverse conditions, highlighting improvements in both perceptual and objective metrics. The work advances practical deployment of IVIF in open-world settings by enhancing generalization and stability while outlining paths for efficiency optimization and broader applicability.

Abstract

Infrared and visible image fusion (IVIF) is increasingly applied in critical fields such as video surveillance and autonomous driving systems. Significant progress has been made in deep learning-based fusion methods. However, these models frequently encounter out-of-distribution (OOD) scenes in real-world applications, which severely impact their performance and reliability. Therefore, addressing the challenge of OOD data is crucial for the safe deployment of these models in open-world environments. Unlike existing research, our focus is on the challenges posed by OOD data in real-world applications and on enhancing the robustness and generalization of models. In this paper, we propose an infrared-visible fusion framework based on Multi-View Augmentation. For external data augmentation, Top-k Selective Vision Alignment is employed to mitigate distribution shifts between datasets by performing RGB-wise transformations on visible images. This strategy effectively introduces augmented samples, enhancing the adaptability of the model to complex real-world scenarios. Additionally, for internal data augmentation, self-supervised learning is established using Weak-Aggressive Augmentation. This enables the model to learn more robust and general feature representations during the fusion process, thereby improving robustness and generalization. Extensive experiments demonstrate that the proposed method exhibits superior performance and robustness across various conditions and environments. Our approach significantly enhances the reliability and stability of IVIF tasks in practical applications.

Paper Structure

This paper contains 27 sections, 16 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overall framework of the proposed method. External data augmentation utilizes Top-K selective channel alignment to effectively expand training data scale. Internal data augmentation further employs weak-aggressive augmentation for self-supervised learning. The frequency-aware fusion network achieves more comprehensive and detailed feature extraction and fusion.
  • Figure 2: Implementation details of external Top-k Selective Channel Alignment. Selectively applying RGB-wise gamma transformation on the localized region of each external image, this external strategy effectively alleviates distribution shifts and achieves data augmentation.
  • Figure 3: Comparison of the RGB-wise distribution of datasets before and after Top-K channel alignment processing. Following channel alignment, the distribution of the external-augmented dataset closely approximates that of the target dataset.
  • Figure 4: Example images from M3FD and RoadScene datasets to show the effects of channel alignment.
  • Figure 5: The implementation details of the Internal-view Augmentation for Self-supervised Learning. Through contrasting multiple augmented views through weak-aggressive augmentation, self-supervised learning is established, enhancing the final fusion outcomes.
  • ...and 4 more figures