Table of Contents
Fetching ...

TSJNet: A Multi-modality Target and Semantic Awareness Joint-driven Image Fusion Network

Yuchan Jie, Yushen Xu, Xiaosong Li, Huafeng Li, Haishu Tan, Feiping Nie

TL;DR

This work tackles the challenge of incomplete information in single-modality imagery for semantic segmentation and object detection by proposing TSJNet, a multi-modal fusion network guided by high-level task semantics. The model employs a ResNeSt-based encoder/decoder paired with a dual-branch Multi-Dimensional Feature Extraction Module (MDM) and a dataset-agnostic spatial attention decoder to fuse shared and salient modality features, while jointly optimizing fusion, detection, and segmentation losses. Evaluations on six public datasets and a newly released UAV Multi-Scenario (UMS) benchmark show TSJNet achieves notable gains in fusion quality and downstream tasks, including an average improvement of 7.97% in AP@0.5 for detection and 10.88% in mIoU for segmentation, with qualitative and quantitative results indicating stronger edge preservation and target emphasis. The authors also release code and the UMS dataset, underscoring practical impact for real-world UAV applications in complex environments.

Abstract

This study aims to address the problem of incomplete information in unimodal images for semantic segmentation and object detection tasks. Existing multimodal fusion methods suffer from limited capability in discriminative modeling of multi-scale semantic structures and salient target regions, which further restricts the effective fusion of task-related semantic details and target information across modalities. To tackle these challenges, this paper proposes a novel fusion network termed TSJNet, which leverages the semantic information output by high-level tasks in a joint manner to guide the fusion process. Specifically, we design a multi-dimensional feature extraction module with dual parallel branches to capture multi-scale and salient features. Meanwhile, a data-agnostic spatial attention module embedded in the decoder dynamically calibrates attention allocation across different data domains, significantly enhancing the model's generalization ability. To optimize both fusion and advanced visual tasks, we balance performance by combining fusion loss with semantic losses. Additionally, we have developed a multimodal unmanned aerial vehicle (UAV) dataset covering multiple scenarios (UMS). Extensive experiments demonstrate that TSJNet achieves outstanding performance on five public datasets (MSRS, M\textsuperscript{3}FD, RoadScene, LLVIP, and TNO) and our UMS dataset. The generated fusion results exhibit favorable visual effects, and compared to state-of-the-art methods, the mean average precision (mAP@0.5) and mean intersection over union (mIoU) for object detection and segmentation, respectively, improve by 7.97\% and 10.88\%.The code and the dataset has been publicly released at https://github.com/XylonXu01/TSJNet.

TSJNet: A Multi-modality Target and Semantic Awareness Joint-driven Image Fusion Network

TL;DR

This work tackles the challenge of incomplete information in single-modality imagery for semantic segmentation and object detection by proposing TSJNet, a multi-modal fusion network guided by high-level task semantics. The model employs a ResNeSt-based encoder/decoder paired with a dual-branch Multi-Dimensional Feature Extraction Module (MDM) and a dataset-agnostic spatial attention decoder to fuse shared and salient modality features, while jointly optimizing fusion, detection, and segmentation losses. Evaluations on six public datasets and a newly released UAV Multi-Scenario (UMS) benchmark show TSJNet achieves notable gains in fusion quality and downstream tasks, including an average improvement of 7.97% in AP@0.5 for detection and 10.88% in mIoU for segmentation, with qualitative and quantitative results indicating stronger edge preservation and target emphasis. The authors also release code and the UMS dataset, underscoring practical impact for real-world UAV applications in complex environments.

Abstract

This study aims to address the problem of incomplete information in unimodal images for semantic segmentation and object detection tasks. Existing multimodal fusion methods suffer from limited capability in discriminative modeling of multi-scale semantic structures and salient target regions, which further restricts the effective fusion of task-related semantic details and target information across modalities. To tackle these challenges, this paper proposes a novel fusion network termed TSJNet, which leverages the semantic information output by high-level tasks in a joint manner to guide the fusion process. Specifically, we design a multi-dimensional feature extraction module with dual parallel branches to capture multi-scale and salient features. Meanwhile, a data-agnostic spatial attention module embedded in the decoder dynamically calibrates attention allocation across different data domains, significantly enhancing the model's generalization ability. To optimize both fusion and advanced visual tasks, we balance performance by combining fusion loss with semantic losses. Additionally, we have developed a multimodal unmanned aerial vehicle (UAV) dataset covering multiple scenarios (UMS). Extensive experiments demonstrate that TSJNet achieves outstanding performance on five public datasets (MSRS, M\textsuperscript{3}FD, RoadScene, LLVIP, and TNO) and our UMS dataset. The generated fusion results exhibit favorable visual effects, and compared to state-of-the-art methods, the mean average precision (mAP@0.5) and mean intersection over union (mIoU) for object detection and segmentation, respectively, improve by 7.97\% and 10.88\%.The code and the dataset has been publicly released at https://github.com/XylonXu01/TSJNet.
Paper Structure (27 sections, 12 equations, 10 figures, 5 tables)

This paper contains 27 sections, 12 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison of results with SOTA methods on the M3FD Liu_Fan_Huang_Wu_Liu_Zhong_Luo and MSRS Tang_Yuan_Zhang_Jiang_Ma datasets. The radar map highlights the superiority of the TSJNet.
  • Figure 2: Unmanned Aerial Vehicle Data Acquisition System.The visible and infrared image pairs in the same scene are captured using a UAV equipped with both a zoom camera and an infrared camera.
  • Figure 3: Fusion gradient optimization process. The joint loss design of fusion, detection, and segmentation optimizes the balance among the fusion, detection, and segmentation solutions to achieve the optimal global solution.
  • Figure 4: The structure of neighborhood attention transformer.
  • Figure 5: Framework of the proposed TSJNet with dual drivers of segmentation and detection. Our model comprises a base ResNeSt encoder, a dual-branch fusion layer, and a base ResNeSt decoder. The MDM module integrates two parallel branches: NAT, which captures multi-scale contextual dependencies through localized attention and hierarchical design, and MFM, which focuses on extracting salient structures via residual and saliency-enhanced representations.
  • ...and 5 more figures