MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion
Jingxue Huang, Xilai Li, Tianshu Tan, Xiaosong Li, Tao Ye
TL;DR
The paper addresses infrared-visible image fusion (IVIF) by tackling information-space inconsistencies between modalities that lead to information loss or bias under symmetric fusion. It introduces MMA-UNet, a two-stream architecture with modality-specific encoders (IR-UNet and VI-UNet) and a cross-scale, asymmetric fusion strategy guided by VI information to balance representation spaces. A Centered Kernel Alignment (CKA) analysis supports the design, showing VI features reach deep semantic space faster than IR, which motivates the asymmetric fusion and VI-guided training; the overall loss combines $Loss_{mse}$, $Loss_{ssim}$, and $Loss_{det}$ to ensure reconstruction fidelity, structural integrity, and edge preservation. Empirical results on M3FD and MSRS demonstrate state-of-the-art fusion quality and improved performance on downstream tasks such as detection and segmentation, highlighting MMA-UNet's practical potential for robust multi-modal sensing and analysis.
Abstract
Multi-modal image fusion (MMIF) maps useful information from various modalities into the same representation space, thereby producing an informative fused image. However, the existing fusion algorithms tend to symmetrically fuse the multi-modal images, causing the loss of shallow information or bias towards a single modality in certain regions of the fusion results. In this study, we analyzed the spatial distribution differences of information in different modalities and proved that encoding features within the same network is not conducive to achieving simultaneous deep feature space alignment for multi-modal images. To overcome this issue, a Multi-Modal Asymmetric UNet (MMA-UNet) was proposed. We separately trained specialized feature encoders for different modal and implemented a cross-scale fusion strategy to maintain the features from different modalities within the same representation space, ensuring a balanced information fusion process. Furthermore, extensive fusion and downstream task experiments were conducted to demonstrate the efficiency of MMA-UNet in fusing infrared and visible image information, producing visually natural and semantically rich fusion results. Its performance surpasses that of the state-of-the-art comparison fusion methods.
