Table of Contents
Fetching ...

DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion

Yuchen Guo, Ruoxiang Xu, Rongcheng Li, Weifeng Su

TL;DR

DAE-Fuse tackles robust multi-modality image fusion under adverse conditions by introducing a two-phase discriminative autoencoder with a cross-modality attention fusion module. Phase 1 enhances feature extraction with a dual-branch encoder and adversarial discriminators, while Phase 2 performs cross-modality fusion through cross-attention and an adversarial fusion objective to avoid modality bias. The approach achieves state-of-the-art results on infrared-visible image fusion benchmarks and improves object detection performance, with strong generalization to medical image fusion tasks and initial temporal consistency for video fusion. This work advances practical perception for autonomous navigation and surveillance by delivering sharp, texture-rich fused images and temporally stable video outputs.

Abstract

In extreme scenarios such as nighttime or low-visibility environments, achieving reliable perception is critical for applications like autonomous driving, robotics, and surveillance. Multi-modality image fusion, particularly integrating infrared imaging, offers a robust solution by combining complementary information from different modalities to enhance scene understanding and decision-making. However, current methods face significant limitations: GAN-based approaches often produce blurry images that lack fine-grained details, while AE-based methods may introduce bias toward specific modalities, leading to unnatural fusion results. To address these challenges, we propose DAE-Fuse, a novel two-phase discriminative autoencoder framework that generates sharp and natural fused images. Furthermore, We pioneer the extension of image fusion techniques from static images to the video domain while preserving temporal consistency across frames, thus advancing the perceptual capabilities required for autonomous navigation. Extensive experiments on public datasets demonstrate that DAE-Fuse achieves state-of-the-art performance on multiple benchmarks, with superior generalizability to tasks like medical image fusion.

DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion

TL;DR

DAE-Fuse tackles robust multi-modality image fusion under adverse conditions by introducing a two-phase discriminative autoencoder with a cross-modality attention fusion module. Phase 1 enhances feature extraction with a dual-branch encoder and adversarial discriminators, while Phase 2 performs cross-modality fusion through cross-attention and an adversarial fusion objective to avoid modality bias. The approach achieves state-of-the-art results on infrared-visible image fusion benchmarks and improves object detection performance, with strong generalization to medical image fusion tasks and initial temporal consistency for video fusion. This work advances practical perception for autonomous navigation and surveillance by delivering sharp, texture-rich fused images and temporally stable video outputs.

Abstract

In extreme scenarios such as nighttime or low-visibility environments, achieving reliable perception is critical for applications like autonomous driving, robotics, and surveillance. Multi-modality image fusion, particularly integrating infrared imaging, offers a robust solution by combining complementary information from different modalities to enhance scene understanding and decision-making. However, current methods face significant limitations: GAN-based approaches often produce blurry images that lack fine-grained details, while AE-based methods may introduce bias toward specific modalities, leading to unnatural fusion results. To address these challenges, we propose DAE-Fuse, a novel two-phase discriminative autoencoder framework that generates sharp and natural fused images. Furthermore, We pioneer the extension of image fusion techniques from static images to the video domain while preserving temporal consistency across frames, thus advancing the perceptual capabilities required for autonomous navigation. Extensive experiments on public datasets demonstrate that DAE-Fuse achieves state-of-the-art performance on multiple benchmarks, with superior generalizability to tasks like medical image fusion.
Paper Structure (28 sections, 15 equations, 5 figures, 4 tables)

This paper contains 28 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The workflow of the adversarial feature extraction phase. The cross-attention for fusion purpose is dismissed.
  • Figure 2: The workflow of the attention-guided cross-modality fusion phase.
  • Figure 3: Object detection ability of DAE-Fuse: the visible image can detect the car in the right but fail to capture the people; the infrared displays an opposite ability on this two objects; and the fused image from DAE-Fuse successfully detects all of them.
  • Figure 4: Qualitative comparison with state-of-the-art methods on TNO dataset.
  • Figure 5: Qualitative comparison with state-of-the-art methods on MRI-CT dataset.