Table of Contents
Fetching ...

Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond

Pengwei Liang, Junjun Jiang, Qing Ma, Xianming Liu, Jiayi Ma

TL;DR

DeFusion++ stands out by producing versatile fused representations that can enhance both the quality of image fusion and the effectiveness of downstream high-level vision tasks, simplifying the process with the elegant fusion framework.

Abstract

Image fusion is famous as an alternative solution to generate one high-quality image from multiple images in addition to image restoration from a single degraded image. The essence of image fusion is to integrate complementary information from source images. Existing fusion methods struggle with generalization across various tasks and often require labor-intensive designs, in which it is difficult to identify and extract useful information from source images due to the diverse requirements of each fusion task. Additionally, these methods develop highly specialized features for different downstream applications, hindering the adaptation to new and diverse downstream tasks. To address these limitations, we introduce DeFusion++, a novel framework that leverages self-supervised learning (SSL) to enhance the versatility of feature representation for different image fusion tasks. DeFusion++ captures the image fusion task-friendly representations from large-scale data in a self-supervised way, overcoming the constraints of limited fusion datasets. Specifically, we introduce two innovative pretext tasks: common and unique decomposition (CUD) and masked feature modeling (MFM). CUD decomposes source images into abstract common and unique components, while MFM refines these components into robust fused features. Jointly training of these tasks enables DeFusion++ to produce adaptable representations that can effectively extract useful information from various source images, regardless of the fusion task. The resulting fused representations are also highly adaptable for a wide range of downstream tasks, including image segmentation and object detection. DeFusion++ stands out by producing versatile fused representations that can enhance both the quality of image fusion and the effectiveness of downstream high-level vision tasks, simplifying the process with the elegant fusion framework.

Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond

TL;DR

DeFusion++ stands out by producing versatile fused representations that can enhance both the quality of image fusion and the effectiveness of downstream high-level vision tasks, simplifying the process with the elegant fusion framework.

Abstract

Image fusion is famous as an alternative solution to generate one high-quality image from multiple images in addition to image restoration from a single degraded image. The essence of image fusion is to integrate complementary information from source images. Existing fusion methods struggle with generalization across various tasks and often require labor-intensive designs, in which it is difficult to identify and extract useful information from source images due to the diverse requirements of each fusion task. Additionally, these methods develop highly specialized features for different downstream applications, hindering the adaptation to new and diverse downstream tasks. To address these limitations, we introduce DeFusion++, a novel framework that leverages self-supervised learning (SSL) to enhance the versatility of feature representation for different image fusion tasks. DeFusion++ captures the image fusion task-friendly representations from large-scale data in a self-supervised way, overcoming the constraints of limited fusion datasets. Specifically, we introduce two innovative pretext tasks: common and unique decomposition (CUD) and masked feature modeling (MFM). CUD decomposes source images into abstract common and unique components, while MFM refines these components into robust fused features. Jointly training of these tasks enables DeFusion++ to produce adaptable representations that can effectively extract useful information from various source images, regardless of the fusion task. The resulting fused representations are also highly adaptable for a wide range of downstream tasks, including image segmentation and object detection. DeFusion++ stands out by producing versatile fused representations that can enhance both the quality of image fusion and the effectiveness of downstream high-level vision tasks, simplifying the process with the elegant fusion framework.

Paper Structure

This paper contains 32 sections, 9 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Overview of traditional image fusion approaches illustrated in (a) and (b), respectively. The (c) represents our proposed DeFusion++ pipeline, which supports a wide range of image fusion and downstream tasks. In our method, we propose two self-supervised pretext task: multi-modal common and unique decomposition (CUD) and masked feature modeling (MFM).
  • Figure 2: The overall framework of DeFusion++. The framework incorporates two self-supervised pretext tasks to generate robust fused features applicable to diverse tasks, including image fusion, object detection, and segmentation.
  • Figure 3: The paradigm of common and unique decomposition (CUD) and masked feature modeling (MFM). The (a) depicts the detailed process of CUD. The (b) illustrates the main idea of MFM. In (c), we apply heatmaps to the source images to visually highlight the areas focused on by the unique and common features, showing where these features are identified in training and testing phase.
  • Figure 4: The mechanism of the cross attention layer, which processes two input features $\text{H}(\boldsymbol{x}^1), \text{H}(\boldsymbol{x}^2)$. Based on the task indicator $\mathcal{T}$, it strategically selects different $Q$ (query), $K$ (key), and $V$ (value) to specifically generate either common or unique features tailored to the demands of (M)CUD tasks.
  • Figure 5: Architectural diagram of the multi-modal common and unique decomposition (MCUD). By introducing the cross attention layer and the modality pretrained model, we ensure that the extracted features preserve the modality-specific information.
  • ...and 13 more figures