Table of Contents
Fetching ...

One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion

Chunyang Cheng, Tianyang Xu, Zhenhua Feng, Xiaojun Wu, ZhangyongTang, Hui Li, Zeyang Zhang, Sara Atito, Muhammad Awais, Josef Kittler

TL;DR

This work tackles the generalisation gap in image fusion by shifting from high-level semantic supervision to low-level, pixel-level guidance. It introduces GIFNet, a three-branch architecture that enables cross-task interaction between a multi-modal fusion task and a digital-photography fusion task, reinforced by a reconstruction objective and an adaptive cross-fusion gating mechanism. By creating an RGB-focused joint dataset and using a shared reconstruction branch, GIFNet learns task-agnostic features that generalise to both seen and unseen fusion tasks, while also enabling single-modality enhancement. The approach achieves state-of-the-art performance across diverse fusion benchmarks, reduces computational cost, and demonstrates practical utility for single-modality enhancement in RGB vision tasks, indicating strong potential for real-world deployment.

Abstract

Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at https://github.com/AWCXV/GIFNet.

One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion

TL;DR

This work tackles the generalisation gap in image fusion by shifting from high-level semantic supervision to low-level, pixel-level guidance. It introduces GIFNet, a three-branch architecture that enables cross-task interaction between a multi-modal fusion task and a digital-photography fusion task, reinforced by a reconstruction objective and an adaptive cross-fusion gating mechanism. By creating an RGB-focused joint dataset and using a shared reconstruction branch, GIFNet learns task-agnostic features that generalise to both seen and unseen fusion tasks, while also enabling single-modality enhancement. The approach achieves state-of-the-art performance across diverse fusion benchmarks, reduces computational cost, and demonstrates practical utility for single-modality enhancement in RGB vision tasks, indicating strong potential for real-world deployment.

Abstract

Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at https://github.com/AWCXV/GIFNet.

Paper Structure

This paper contains 17 sections, 10 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A comparison of the versatility and efficiency of advanced multi-task fusion methods. The indices in the arrows are the fusion tasks validated by the corresponding methods.
  • Figure 2: Comparison of advanced multi-task fusion methods relying on high-level tasks and the proposed low-level task interaction paradigm. These semantic-focused paradigms cannot consistently ensure the robust fusion quality as our paradigm does, which provides the pixel-level supervision and presents clear texture details.
  • Figure 3: The network architecture and training process of GIFNet. As shown in diagram (d), the Multi-Modal (MM) and Digital Photography (DP) branches of our model are trained alternately, based on the specifically designed cross-fusion gating mechanism (c).
  • Figure 4: An illustration of the inference phase of our GIFNet. In this stage, only one pair of images will be used to produce the multi-modal and digital photography features.
  • Figure 5: Quantitative results of the CFGM ablation experiments.
  • ...and 4 more figures