DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion
Yuchen Guo, Ruoxiang Xu, Rongcheng Li, Weifeng Su
TL;DR
DAE-Fuse tackles robust multi-modality image fusion under adverse conditions by introducing a two-phase discriminative autoencoder with a cross-modality attention fusion module. Phase 1 enhances feature extraction with a dual-branch encoder and adversarial discriminators, while Phase 2 performs cross-modality fusion through cross-attention and an adversarial fusion objective to avoid modality bias. The approach achieves state-of-the-art results on infrared-visible image fusion benchmarks and improves object detection performance, with strong generalization to medical image fusion tasks and initial temporal consistency for video fusion. This work advances practical perception for autonomous navigation and surveillance by delivering sharp, texture-rich fused images and temporally stable video outputs.
Abstract
In extreme scenarios such as nighttime or low-visibility environments, achieving reliable perception is critical for applications like autonomous driving, robotics, and surveillance. Multi-modality image fusion, particularly integrating infrared imaging, offers a robust solution by combining complementary information from different modalities to enhance scene understanding and decision-making. However, current methods face significant limitations: GAN-based approaches often produce blurry images that lack fine-grained details, while AE-based methods may introduce bias toward specific modalities, leading to unnatural fusion results. To address these challenges, we propose DAE-Fuse, a novel two-phase discriminative autoencoder framework that generates sharp and natural fused images. Furthermore, We pioneer the extension of image fusion techniques from static images to the video domain while preserving temporal consistency across frames, thus advancing the perceptual capabilities required for autonomous navigation. Extensive experiments on public datasets demonstrate that DAE-Fuse achieves state-of-the-art performance on multiple benchmarks, with superior generalizability to tasks like medical image fusion.
