Table of Contents
Fetching ...

Instruction-Driven Fusion of Infrared-Visible Images: Tailoring for Diverse Downstream Tasks

Zengyi Yang, Yafei Zhang, Huafeng Li, Yu Liu

TL;DR

This work proposes Task-Oriented Adaptive Regulation (T-OAR), an adaptive mechanism specifically designed for multi-task environments that excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity.

Abstract

The primary value of infrared and visible image fusion technology lies in applying the fusion results to downstream tasks. However, existing methods face challenges such as increased training complexity and significantly compromised performance of individual tasks when addressing multiple downstream tasks simultaneously. To tackle this, we propose Task-Oriented Adaptive Regulation (T-OAR), an adaptive mechanism specifically designed for multi-task environments. Additionally, we introduce the Task-related Dynamic Prompt Injection (T-DPI) module, which generates task-specific dynamic prompts from user-input text instructions and integrates them into target representations. This guides the feature extraction module to produce representations that are more closely aligned with the specific requirements of downstream tasks. By incorporating the T-DPI module into the T-OAR framework, our approach generates fusion images tailored to task-specific requirements without the need for separate training or task-specific weights. This not only reduces computational costs but also enhances adaptability and performance across multiple tasks. Experimental results show that our method excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity. This provides an efficient solution for image fusion in multi-task environments, highlighting the technology's potential across diverse applications.

Instruction-Driven Fusion of Infrared-Visible Images: Tailoring for Diverse Downstream Tasks

TL;DR

This work proposes Task-Oriented Adaptive Regulation (T-OAR), an adaptive mechanism specifically designed for multi-task environments that excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity.

Abstract

The primary value of infrared and visible image fusion technology lies in applying the fusion results to downstream tasks. However, existing methods face challenges such as increased training complexity and significantly compromised performance of individual tasks when addressing multiple downstream tasks simultaneously. To tackle this, we propose Task-Oriented Adaptive Regulation (T-OAR), an adaptive mechanism specifically designed for multi-task environments. Additionally, we introduce the Task-related Dynamic Prompt Injection (T-DPI) module, which generates task-specific dynamic prompts from user-input text instructions and integrates them into target representations. This guides the feature extraction module to produce representations that are more closely aligned with the specific requirements of downstream tasks. By incorporating the T-DPI module into the T-OAR framework, our approach generates fusion images tailored to task-specific requirements without the need for separate training or task-specific weights. This not only reduces computational costs but also enhances adaptability and performance across multiple tasks. Experimental results show that our method excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity. This provides an efficient solution for image fusion in multi-task environments, highlighting the technology's potential across diverse applications.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of the proposed method with existing approaches. Existing methods (a) perform well only on the specific downstream tasks for which they are trained, thereby limiting their multi-task adaptability. In contrast, our method (b) uses T-OAR with task instructions to fine-tune the fusion network, thereby simultaneously meeting the requirements of multiple downstream tasks without retraining.
  • Figure 2: Overview of the proposed method. The proposed method integrates downstream task instruction features obtained from LLaMA into T-DPI, which generates dynamic prompts closely aligned with the specific task. This enables the method to flexibly adjust the output features of VI-E and IR-E based on input instructions, ensuring that these features meet the specific requirements of downstream tasks.
  • Figure 3: Structure of the FRB, including the FF, FFD, and RB submodules, where FFD consists of $M$ CRBs and RB contains 3 blocks constructed by Conv layers, BN, and LReLU.
  • Figure 4: Structure of the T-DPI, composed of GAP, GMP, Adapter, and CPPB.
  • Figure 5: Comparison of visual effects with SOTA methods. The figure is divided into three sections, each with two rows. The input images are from the M$^{3}$FD, FMB, and VT5000 datasets, validated on OD, SS, and SOD tasks. The first column shows the IR-VIS source images and their corresponding GT for the downstream tasks. Columns two to seven display the fusion results and downstream task outcomes from the comparison methods.
  • ...and 2 more figures