Table of Contents
Fetching ...

Task-driven Image Fusion with Learnable Fusion Loss

Haowen Bai, Jiangshe Zhang, Zixiang Zhao, Yichen Wu, Lilun Deng, Yukun Cui, Tao Feng, Shuang Xu

TL;DR

This work tackles the rigidity of predefined fusion objectives by introducing Task-driven Image Fusion (TDFusion), a framework where a loss generation module outputs a learnable fusion loss $\mathcal{L}_f$ that is optimized with respect to downstream task performance. The fusion network $\mathcal{F}$ and the downstream task network $\mathcal{T}$ are trained in a meta-learning loop inspired by MAML, alternating inner and outer updates to continually adapt $\mathcal{L}_f$ to minimize the task loss $\mathcal{L}_t$ on fused images. The fusion loss blends an intensity-guided term and a gradient-preservation term, with per-pixel weights $w_a$ and $w_b$ produced by $\mathcal{G}$ and constrained by Softmax, enabling selective information retention from source images. Across four fusion datasets and downstream tasks like semantic segmentation and object detection, TDFusion consistently improves both fusion quality (e.g., higher $SSIM$, $VIF$, and $Q^{AB/F}$) and downstream metrics (e.g., mIoU and AP), demonstrating strong adaptability and task-aligned fusion behavior.

Abstract

Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual features compared to single-source images, often improving downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the downstream task loss in a meta-learning manner. The learning objective is to minimize the task loss of fused images after optimizing the fusion module with the fusion loss. Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives. TDFusion's training relies entirely on the downstream task loss, making it adaptable to any specific task. It can be applied to any architecture of fusion and task networks. Experiments demonstrate TDFusion's performance through fusion experiments conducted on four different datasets, in addition to evaluations on semantic segmentation and object detection tasks.

Task-driven Image Fusion with Learnable Fusion Loss

TL;DR

This work tackles the rigidity of predefined fusion objectives by introducing Task-driven Image Fusion (TDFusion), a framework where a loss generation module outputs a learnable fusion loss that is optimized with respect to downstream task performance. The fusion network and the downstream task network are trained in a meta-learning loop inspired by MAML, alternating inner and outer updates to continually adapt to minimize the task loss on fused images. The fusion loss blends an intensity-guided term and a gradient-preservation term, with per-pixel weights and produced by and constrained by Softmax, enabling selective information retention from source images. Across four fusion datasets and downstream tasks like semantic segmentation and object detection, TDFusion consistently improves both fusion quality (e.g., higher , , and ) and downstream metrics (e.g., mIoU and AP), demonstrating strong adaptability and task-aligned fusion behavior.

Abstract

Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual features compared to single-source images, often improving downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the downstream task loss in a meta-learning manner. The learning objective is to minimize the task loss of fused images after optimizing the fusion module with the fusion loss. Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives. TDFusion's training relies entirely on the downstream task loss, making it adaptable to any specific task. It can be applied to any architecture of fusion and task networks. Experiments demonstrate TDFusion's performance through fusion experiments conducted on four different datasets, in addition to evaluations on semantic segmentation and object detection tasks.

Paper Structure

This paper contains 22 sections, 9 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The TDFusion workflow alternates between training the loss generation module and the fusion module. Training of the loss generation module involves both inner and outer updates, learned through meta-learning.
  • Figure 2: Visual comparison of fusion results. The cases are “01258N” in MSRS dataset, “00122” in FMB dataset, “00449” in M3FD dataset and “200304” in LLVIP dataset.
  • Figure 3: Visual comparison for Semantic Segmentation. The cases are “00726N” in MSRS dataset and “01438” in FMB dataset.
  • Figure 4: Visual comparison for Object Detection. The cases are “02236” in M3FD dataset and “210145” in LLVIP dataset.
  • Figure 5: Visualisation of learnable loss for downstream tasks.