Table of Contents
Fetching ...

MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion

Jingxue Huang, Xilai Li, Tianshu Tan, Xiaosong Li, Tao Ye

TL;DR

The paper addresses infrared-visible image fusion (IVIF) by tackling information-space inconsistencies between modalities that lead to information loss or bias under symmetric fusion. It introduces MMA-UNet, a two-stream architecture with modality-specific encoders (IR-UNet and VI-UNet) and a cross-scale, asymmetric fusion strategy guided by VI information to balance representation spaces. A Centered Kernel Alignment (CKA) analysis supports the design, showing VI features reach deep semantic space faster than IR, which motivates the asymmetric fusion and VI-guided training; the overall loss combines $Loss_{mse}$, $Loss_{ssim}$, and $Loss_{det}$ to ensure reconstruction fidelity, structural integrity, and edge preservation. Empirical results on M3FD and MSRS demonstrate state-of-the-art fusion quality and improved performance on downstream tasks such as detection and segmentation, highlighting MMA-UNet's practical potential for robust multi-modal sensing and analysis.

Abstract

Multi-modal image fusion (MMIF) maps useful information from various modalities into the same representation space, thereby producing an informative fused image. However, the existing fusion algorithms tend to symmetrically fuse the multi-modal images, causing the loss of shallow information or bias towards a single modality in certain regions of the fusion results. In this study, we analyzed the spatial distribution differences of information in different modalities and proved that encoding features within the same network is not conducive to achieving simultaneous deep feature space alignment for multi-modal images. To overcome this issue, a Multi-Modal Asymmetric UNet (MMA-UNet) was proposed. We separately trained specialized feature encoders for different modal and implemented a cross-scale fusion strategy to maintain the features from different modalities within the same representation space, ensuring a balanced information fusion process. Furthermore, extensive fusion and downstream task experiments were conducted to demonstrate the efficiency of MMA-UNet in fusing infrared and visible image information, producing visually natural and semantically rich fusion results. Its performance surpasses that of the state-of-the-art comparison fusion methods.

MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion

TL;DR

The paper addresses infrared-visible image fusion (IVIF) by tackling information-space inconsistencies between modalities that lead to information loss or bias under symmetric fusion. It introduces MMA-UNet, a two-stream architecture with modality-specific encoders (IR-UNet and VI-UNet) and a cross-scale, asymmetric fusion strategy guided by VI information to balance representation spaces. A Centered Kernel Alignment (CKA) analysis supports the design, showing VI features reach deep semantic space faster than IR, which motivates the asymmetric fusion and VI-guided training; the overall loss combines , , and to ensure reconstruction fidelity, structural integrity, and edge preservation. Empirical results on M3FD and MSRS demonstrate state-of-the-art fusion quality and improved performance on downstream tasks such as detection and segmentation, highlighting MMA-UNet's practical potential for robust multi-modal sensing and analysis.

Abstract

Multi-modal image fusion (MMIF) maps useful information from various modalities into the same representation space, thereby producing an informative fused image. However, the existing fusion algorithms tend to symmetrically fuse the multi-modal images, causing the loss of shallow information or bias towards a single modality in certain regions of the fusion results. In this study, we analyzed the spatial distribution differences of information in different modalities and proved that encoding features within the same network is not conducive to achieving simultaneous deep feature space alignment for multi-modal images. To overcome this issue, a Multi-Modal Asymmetric UNet (MMA-UNet) was proposed. We separately trained specialized feature encoders for different modal and implemented a cross-scale fusion strategy to maintain the features from different modalities within the same representation space, ensuring a balanced information fusion process. Furthermore, extensive fusion and downstream task experiments were conducted to demonstrate the efficiency of MMA-UNet in fusing infrared and visible image information, producing visually natural and semantically rich fusion results. Its performance surpasses that of the state-of-the-art comparison fusion methods.
Paper Structure (17 sections, 5 equations, 7 figures, 4 tables)

This paper contains 17 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Centered Kernel Alignment. (a) and (b) compute the CKA similarity between all pairs of layers in a single neural network. (c) computes the CKA similarity between all pairs of layers of IR-UNet and VI-UNet. The x and y axes represent the indexing layers.
  • Figure 2: Workflow of MMA-UNet.
  • Figure 3: VI1 represents the intermediate feature representation of the first convolutional layer in UNet. The same applies to IR1, IR2, IR3 and VI2. VI1+IR1 indicates that two features are added together to obtain a fusion map. VI1+IR2, VI2+IR2, VI2+IR3, likewise. For ease of representation, we omit the sampling operation.
  • Figure 4: Centered Kernel Alignment. (a) and (b) represent the computation of the CKA similarity between all pairs of layers in IR-UNet with and without guidance mechanism, respectively.
  • Figure 5: Subjective comparisons of fusion results obtained by MMA-UNet and the SoTA comparison methods on M3FD and MSRS.
  • ...and 2 more figures