Table of Contents
Fetching ...

UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation

Xingyuan Li, Songcheng Du, Yang Zou, HaoYuan Xu, Zhiying Jiang, Jinyuan Liu

Abstract

Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real-world scenarios. Code is available at https://github.com/dusongcheng/UniFusion.

UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation

Abstract

Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real-world scenarios. Code is available at https://github.com/dusongcheng/UniFusion.
Paper Structure (17 sections, 6 equations, 8 figures, 2 tables)

This paper contains 17 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of the proposed bilevel optimization framework for unified image fusion task. (Left) Conventional approaches are typically designed for task-specific scenarios and struggle to effectively preserve source information during the fusion process, leading to inconsistent modality representation and information loss. (Middle) Our UniFusion formulates fusion as a bilevel optimization problem. The lower-level reconstruction branch learns modality-consistent representations through self-reconstruction, while the upper-level fusion branch adaptively integrates them into a unified representation that effectively enhances semantic preservation and structural integrity. (Right) Quantitative and qualitative results demonstrate that our method consistently outperforms TC-MoA across multiple metrics.
  • Figure 2: Overview of the proposed UniFusion framework.
  • Figure 3: Quantitative comparison on MIF (top) and MFIF (bottom) datasets. The plots illustrate the distribution of all test samples across five evaluation metrics, where “–” and “$\circ$” indicate the median and mean values, respectively.
  • Figure 4: Visual comparison of infrared and visual image fusion results with SOTA methods on M$^3$FD (top) and T$\&$R (bottom) datasets.
  • Figure 5: Visual comparison of medical image fusion results with SOTA methods on MIF dataset.
  • ...and 3 more figures