Table of Contents
Fetching ...

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao, Yu Zhong, Ziqi Wang, Liang-Jian Deng

TL;DR

MMAIF tackles the challenge of performing multiple image fusion tasks under realistic degradations by unifying restoration and fusion into a single model that is guided by language prompts. It introduces a real-world degradation pipeline and an all-in-one latent-space Diffusion Transformer (DiT) with architectural innovations, offering both a deterministic regression and a generative flow-matching variant. The approach demonstrates superior performance over restoration+fusion and prior all-in-one methods across VIF, MEF, and MFF tasks, while achieving substantial inference speedups by operating in latent space. The combination of prompt-guided degradation handling, MoE-augmented DiT, and efficient flow sampling yields a practical, flexible solution for real-world fusion scenarios.

Abstract

Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

TL;DR

MMAIF tackles the challenge of performing multiple image fusion tasks under realistic degradations by unifying restoration and fusion into a single model that is guided by language prompts. It introduces a real-world degradation pipeline and an all-in-one latent-space Diffusion Transformer (DiT) with architectural innovations, offering both a deterministic regression and a generative flow-matching variant. The approach demonstrates superior performance over restoration+fusion and prior all-in-one methods across VIF, MEF, and MFF tasks, while achieving substantial inference speedups by operating in latent space. The combination of prompt-guided degradation handling, MoE-augmented DiT, and efficient flow sampling yields a practical, flexible solution for real-world fusion scenarios.

Abstract

Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.

Paper Structure

This paper contains 26 sections, 12 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Main conception of the proposed MMAIF.
  • Figure 2: The proposed real-world image degradation pipeline. Different clean image pairs from multiple tasks are sent to a pre-trained fusion network to obtain the GT. The pairs are also fed into many compositional degradation operators (e.g., haze, rain) to get degraded pairs. While leveraging ChatGPT to generate restoration prompts, we can collect degraded/clean/GT/prompts pairs.
  • Figure 3: Comparisons with previous pipelines. (a) Naive restoration + fusion pipeline. It causes a complex inference process which restores to tiling the high-resolution images and needs to handle different degradation or fusion tasks by distinct models. (b) Recent all-in-one models only take account of multiple degradations but neglect task-level all-in-one and have large FLOPs when operating in pixel space. (c) Our framework operates in the latent space, leveraging a modernized DiT. This enables the training of a unified task-level and degradation-level all-in-one model that supports either fast regression or refined flow matching.
  • Figure 4: Qualitative comparisons of previous methods and proposed MMAIF on VIF LLVIP, MEF SICE, and MFF RealMFF datasets.
  • Figure 5: Combined degradations and fused results.