Table of Contents
Fetching ...

OCCO: LVM-guided Infrared and Visible Image Fusion Framework based on Object-aware and Contextual COntrastive Learning

Hui Li, Congcong Bian, Zeyang Zhang, Xiaoning Song, Xi Li, Xiao-Jun Wu

TL;DR

The paper addresses the challenge of balancing high-quality fused images with strong downstream task performance in infrared–visible image fusion. It introduces OCCO, a large-vision-model guided framework that leverages SAM and Grounding DINO to obtain semantic masks and employs contextual contrastive learning in a latent contextual space, integrated via a Feature Interaction Fusion Network. The method defines three semantic-based contrastive losses and combines them with pixel-level losses to form a total objective, improving saliency of targets while preserving scene information. Extensive experiments on four datasets against eight baselines show superior fusion quality and enhanced downstream detection, highlighting the approach's robustness across diverse conditions and its practical potential for surveillance and multi-modal sensing.

Abstract

Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.

OCCO: LVM-guided Infrared and Visible Image Fusion Framework based on Object-aware and Contextual COntrastive Learning

TL;DR

The paper addresses the challenge of balancing high-quality fused images with strong downstream task performance in infrared–visible image fusion. It introduces OCCO, a large-vision-model guided framework that leverages SAM and Grounding DINO to obtain semantic masks and employs contextual contrastive learning in a latent contextual space, integrated via a Feature Interaction Fusion Network. The method defines three semantic-based contrastive losses and combines them with pixel-level losses to form a total objective, improving saliency of targets while preserving scene information. Extensive experiments on four datasets against eight baselines show superior fusion quality and enhanced downstream detection, highlighting the approach's robustness across diverse conditions and its practical potential for surveillance and multi-modal sensing.

Abstract

Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.

Paper Structure

This paper contains 20 sections, 19 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: From (a) to (d) are visual-driven method, semantic-driven method, feature-level semantic-driven method and object-aware method. The visual quality in (a) is outstanding with preserved information integrity, albeit showing slightly weaker performance in downstream tasks. Combining segmentation networks in both (b) and (c) enhances downstream performance but has an impact on the quality of the fused images. Focusing exclusively on the fusion task in (d) improves downstream performance without disrupting the fusion results.
  • Figure 2: The pipeline of OCCO and the framework of Feature Interaction Fusion Network. During the training process, only the first group of fused image within a batch sever as anchor samples, and source images treated as positive sample, while all data generate negative samples.
  • Figure 3: The architecture of FIFB is composed of spatial enhancement, cross channel and cross attention three parts.
  • Figure 4: (a) Demonstration of the single-modal mask generation process. (b) Label provided in the MSRS dataset.
  • Figure 5: Demonstration of mask discernment.
  • ...and 10 more figures