Table of Contents
Fetching ...

Equivariant Multi-Modality Image Fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Kai Zhang, Shuang Xu, Dongdong Chen, Radu Timofte, Luc Van Gool

TL;DR

This work tackles the lack of ground-truth fused images in multi-modality fusion by introducing EMMA, a self-supervised framework grounded in an equivariant imaging prior. EMMA combines a U-Fuser based fusion module, a learnable but frozen pseudo-sensing network, and an equivariant fusion module to enforce sensing-imaging consistency and transformation equivariance. The approach yields high-quality infrared–visible and medical image fusion and translates into measurable gains for downstream segmentation and detection tasks. Experiments and ablations validate the necessity of the sensing and equivariant losses and demonstrate strong generalization from IVF to MIF, underscoring the practical impact of integrating equivariant priors into fusion frameworks.

Abstract

Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA.

Equivariant Multi-Modality Image Fusion

TL;DR

This work tackles the lack of ground-truth fused images in multi-modality fusion by introducing EMMA, a self-supervised framework grounded in an equivariant imaging prior. EMMA combines a U-Fuser based fusion module, a learnable but frozen pseudo-sensing network, and an equivariant fusion module to enforce sensing-imaging consistency and transformation equivariance. The approach yields high-quality infrared–visible and medical image fusion and translates into measurable gains for downstream segmentation and detection tasks. Experiments and ablations validate the necessity of the sensing and equivariant losses and demonstrate strong generalization from IVF to MIF, underscoring the practical impact of integrating equivariant priors into fusion frameworks.

Abstract

Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA.
Paper Structure (13 sections, 1 theorem, 6 equations, 4 figures, 4 tables)

This paper contains 13 sections, 1 theorem, 6 equations, 4 figures, 4 tables.

Key Result

Theorem 1

If we regard $\mathcal{I}$ in definition2 to be the composite function $\mathcal{F}\!\circ\!\mathcal{A}$, where $\mathcal{F}$ is the fusion model and $\mathcal{A}$ (including $\mathcal{A}_i$ and $\mathcal{A}_v$) is the sensing model, the equivariant image fusion theorem is:

Figures (4)

  • Figure 1: Workflow for EMMA. The image pair $\left\{\boldsymbol{i},\boldsymbol{v}\right\}$ are initially input into U-Fuser $\mathcal{F}$, resulting in the fused image $\boldsymbol{f}$. Next, a series of transformations $T_g$ containing shift, rotation, reflection, etc., are applied to $\boldsymbol{f}$ to produce $\boldsymbol{f}_t$. $\boldsymbol{f}_t$ is then fed into the parameter-frozen $\left\{\mathcal{A}_i,\mathcal{A}_v\right\}$ to generate the pseudo-sensing images $\left\{\boldsymbol{i}_t,\boldsymbol{v}_t\right\}$, which are finally input into $\mathcal{F}$ to obtain the re-fused image $\boldsymbol{\hat{f}}_t$.
  • Figure 2: Visual comparison of "06832" from RoadScene xu2020aaai IVF dataset.
  • Figure 3: Visual comparison of "00782N" from MSRS DBLP:journals/inffus/TangYZJM22 IVF dataset.
  • Figure 4: Visual comparison for MIF task.

Theorems & Definitions (5)

  • Definition 1: Invariant set
  • Definition 2: Equivariant function
  • Theorem 1: Equivariant image fusion theorem
  • proof
  • Remark 1