Table of Contents
Fetching ...

MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training

Jiayang Li, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie

TL;DR

MaeFuse tackles infrared-visible image fusion by leveraging a pretrained MAE encoder to extract omni features that cover both high-level semantics and low-level textures. A guided two-stage training strategy aligns the fusion layer with the encoder’s feature space, using CFM and MFM to preserve contour and detail while avoiding ViT-induced block artifacts. The method achieves competitive or superior results on multiple public datasets without relying on downstream-task supervision, demonstrating strong generalization and robust texture-semantics fusion. This approach reduces data-label requirements and points to omni-feature fusion as a viable direction for future IVIF research.

Abstract

In this paper, we introduce MaeFuse, a novel autoencoder model designed for Infrared and Visible Image Fusion (IVIF). The existing approaches for image fusion often rely on training combined with downstream tasks to obtain highlevel visual information, which is effective in emphasizing target objects and delivering impressive results in visual quality and task-specific applications. Instead of being driven by downstream tasks, our model called MaeFuse utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks, to obtain perception friendly features with a low cost. In order to eliminate the domain gap of different modal features and the block effect caused by the MAE encoder, we further develop a guided training strategy. This strategy is meticulously crafted to ensure that the fusion layer seamlessly adjusts to the feature space of the encoder, gradually enhancing the fusion performance. The proposed method can facilitate the comprehensive integration of feature vectors from both infrared and visible modalities, thus preserving the rich details inherent in each modal. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets.

MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training

TL;DR

MaeFuse tackles infrared-visible image fusion by leveraging a pretrained MAE encoder to extract omni features that cover both high-level semantics and low-level textures. A guided two-stage training strategy aligns the fusion layer with the encoder’s feature space, using CFM and MFM to preserve contour and detail while avoiding ViT-induced block artifacts. The method achieves competitive or superior results on multiple public datasets without relying on downstream-task supervision, demonstrating strong generalization and robust texture-semantics fusion. This approach reduces data-label requirements and points to omni-feature fusion as a viable direction for future IVIF research.

Abstract

In this paper, we introduce MaeFuse, a novel autoencoder model designed for Infrared and Visible Image Fusion (IVIF). The existing approaches for image fusion often rely on training combined with downstream tasks to obtain highlevel visual information, which is effective in emphasizing target objects and delivering impressive results in visual quality and task-specific applications. Instead of being driven by downstream tasks, our model called MaeFuse utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks, to obtain perception friendly features with a low cost. In order to eliminate the domain gap of different modal features and the block effect caused by the MAE encoder, we further develop a guided training strategy. This strategy is meticulously crafted to ensure that the fusion layer seamlessly adjusts to the feature space of the encoder, gradually enhancing the fusion performance. The proposed method can facilitate the comprehensive integration of feature vectors from both infrared and visible modalities, thus preserving the rich details inherent in each modal. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets.
Paper Structure (18 sections, 12 equations, 17 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Diagram illustrating image-level downstream task-driven fusion and feature-level downstream task-driven fusion.
  • Figure 2: The fusion results with different fusion strategies: (a) mean fusion for two features, (b) cross-attention based fusion without alignment in the feature domain, (c) cross-attention based fusion with alignment in the feature domain, and (d) our MaeFuse. Here we only show the grayscale images for better comparison.
  • Figure 3: Workflow of our proposed MaeFuse. The upper part describes the overall architecture of the network, while the lower part elaborates on the content of the CFM and MFM structures. CFM cross-learning retains useful information, and MFM fuses enriched detail information based on the output content of CFM as a reference. The FFN employs merely two layers of fully connected neural networks, aimed specifically at enhancing the model’s capacity for non-linear learning.
  • Figure 4: The image '00909N' scene is from the MSRS dataset. The first row shows visible images, while the second row shows infrared images. The first column contains the original images, the second column contains gradient images, and the third column contains second derivative images. $\nabla$ is the Sobel operator, and $\Delta$ is the Laplacian operator.
  • Figure 5: The schematic diagram illustrates our two-stage training approach. The left of each fusion module displays the features obtained by the encoder, whereas the right shows the features obtained by the fusion layer. The first stage involves aligning the feature domains of the fusion layer and the encoder. The second stage progresses with training using a fusion loss function Eq. (\ref{['eq:loss_total']}). This two-stage training strategy is designed to effectively circumvent the issue of becoming trapped in local optima.
  • ...and 12 more figures