Table of Contents
Fetching ...

One Latent Space to Rule All Degradations: Unifying Restoration Knowledge for Image Fusion

Haolong Ma, Hui Li, Chunyang Cheng, Zeyang Zhang, Xiaoqing Luo, Xiaoning Song, Xiao-Jun Wu

TL;DR

The paper tackles degradations in multi-modal infrared–visible fusion, critiquing current All-in-One degradation-aware models for relying on synthetic data and entangled data-level degradations. It introduces LURE, a two-stage framework that first learns a Unified Latent Feature Space (ULFS) from high-quality restoration data and then learns fusion rules within that space, aided by a pseudo-degradation task to stabilize distribution alignment. A novel inner residual design and a Text-Guided Attention mechanism support robust feature learning and effective degradation-agnostic fusion, with losses including $\,\mathcal{L}_{unified}$ to align latent representations across degradations. Empirically, LURE achieves state-of-the-art results on vanilla and degradation-aware fusion benchmarks and improves downstream multi-modal semantic segmentation, while reducing reliance on synthetic degradation datasets. The approach offers a scalable, generalizable path for robust multi-modal fusion across diverse real-world degradations and can extend to other multi-modal fusion tasks.

Abstract

All-in-One Degradation-Aware Fusion Models (ADFMs) as one of multi-modal image fusion models, which aims to address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs rely on end-to-end learning and heavily synthesized datasets to achieve degradation awareness and fusion. This rough learning strategy and non-real world scenario dataset dependence often limit their upper-bound performance, leading to low-quality results. To address these limitations, we present LURE, a Learning-driven Unified REpresentation model for infrared and visible image fusion, which is degradation-aware. LURE learns a Unified Latent Feature Space (ULFS) to avoid the dependency on complex data formats inherent in previous end-to-end learning pipelines. It further improves image fusion quality by leveraging the intrinsic relationships between multi-modalities. A novel loss function is also proposed to drive the learning of unified latent representations more stable.More importantly, LURE seamlessly incorporates existing high-quality real-world image restoration datasets. To further enhance the model's representation capability, we design a simple yet effective structure, termed internal residual block, to facilitate the learning of latent features. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code is available in the supplementary materials.

One Latent Space to Rule All Degradations: Unifying Restoration Knowledge for Image Fusion

TL;DR

The paper tackles degradations in multi-modal infrared–visible fusion, critiquing current All-in-One degradation-aware models for relying on synthetic data and entangled data-level degradations. It introduces LURE, a two-stage framework that first learns a Unified Latent Feature Space (ULFS) from high-quality restoration data and then learns fusion rules within that space, aided by a pseudo-degradation task to stabilize distribution alignment. A novel inner residual design and a Text-Guided Attention mechanism support robust feature learning and effective degradation-agnostic fusion, with losses including to align latent representations across degradations. Empirically, LURE achieves state-of-the-art results on vanilla and degradation-aware fusion benchmarks and improves downstream multi-modal semantic segmentation, while reducing reliance on synthetic degradation datasets. The approach offers a scalable, generalizable path for robust multi-modal fusion across diverse real-world degradations and can extend to other multi-modal fusion tasks.

Abstract

All-in-One Degradation-Aware Fusion Models (ADFMs) as one of multi-modal image fusion models, which aims to address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs rely on end-to-end learning and heavily synthesized datasets to achieve degradation awareness and fusion. This rough learning strategy and non-real world scenario dataset dependence often limit their upper-bound performance, leading to low-quality results. To address these limitations, we present LURE, a Learning-driven Unified REpresentation model for infrared and visible image fusion, which is degradation-aware. LURE learns a Unified Latent Feature Space (ULFS) to avoid the dependency on complex data formats inherent in previous end-to-end learning pipelines. It further improves image fusion quality by leveraging the intrinsic relationships between multi-modalities. A novel loss function is also proposed to drive the learning of unified latent representations more stable.More importantly, LURE seamlessly incorporates existing high-quality real-world image restoration datasets. To further enhance the model's representation capability, we design a simple yet effective structure, termed internal residual block, to facilitate the learning of latent features. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code is available in the supplementary materials.

Paper Structure

This paper contains 26 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of the limitations in existing methods due to data-level issues. "Paired" denotes datasets containing corresponding infrared–visible image pairs.
  • Figure 2: t-SNE visualization of ULFS distributions for tasks (e.g., HZ denotes Dehaze; other abbreviations are detailed in Sec.\ref{['sec:data']}.) of both modalities across training iterations. It reveals initial distinct task distributions gradually merging into a unified distribution. Detailed t-SNE visualizations are in the supplementary material.
  • Figure 3: Two-stage training overview. (a) Stage I learns a unified latent space guided by text. (b) Stage II freezes encoders and trains a fusion module for strategy learning. “Pseudo prompt” denotes the prompt for pseudo degradation (details in Supplementary).
  • Figure 4: Schematic diagrams of Encoder Layer and Text-Guided Attention (TGA). (a) Structure diagram of the Encoder Layer. (b) Structure diagram of TGA. "$\odot, \oplus, \ominus$" represent Element-wise operations, and $\sigma$ represents the Sigmoid function.
  • Figure 5: Qualitative comparison of degradation-aware fusion tasks. "eir" denotes External Image Restoration Methods. Text-IF uses "eir" only for Super Resolution and combined degradations. For more detail about the qualitative comparison, please refer to supplementary materials.
  • ...and 2 more figures