TSJNet: A Multi-modality Target and Semantic Awareness Joint-driven Image Fusion Network
Yuchan Jie, Yushen Xu, Xiaosong Li, Huafeng Li, Haishu Tan, Feiping Nie
TL;DR
This work tackles the challenge of incomplete information in single-modality imagery for semantic segmentation and object detection by proposing TSJNet, a multi-modal fusion network guided by high-level task semantics. The model employs a ResNeSt-based encoder/decoder paired with a dual-branch Multi-Dimensional Feature Extraction Module (MDM) and a dataset-agnostic spatial attention decoder to fuse shared and salient modality features, while jointly optimizing fusion, detection, and segmentation losses. Evaluations on six public datasets and a newly released UAV Multi-Scenario (UMS) benchmark show TSJNet achieves notable gains in fusion quality and downstream tasks, including an average improvement of 7.97% in AP@0.5 for detection and 10.88% in mIoU for segmentation, with qualitative and quantitative results indicating stronger edge preservation and target emphasis. The authors also release code and the UMS dataset, underscoring practical impact for real-world UAV applications in complex environments.
Abstract
This study aims to address the problem of incomplete information in unimodal images for semantic segmentation and object detection tasks. Existing multimodal fusion methods suffer from limited capability in discriminative modeling of multi-scale semantic structures and salient target regions, which further restricts the effective fusion of task-related semantic details and target information across modalities. To tackle these challenges, this paper proposes a novel fusion network termed TSJNet, which leverages the semantic information output by high-level tasks in a joint manner to guide the fusion process. Specifically, we design a multi-dimensional feature extraction module with dual parallel branches to capture multi-scale and salient features. Meanwhile, a data-agnostic spatial attention module embedded in the decoder dynamically calibrates attention allocation across different data domains, significantly enhancing the model's generalization ability. To optimize both fusion and advanced visual tasks, we balance performance by combining fusion loss with semantic losses. Additionally, we have developed a multimodal unmanned aerial vehicle (UAV) dataset covering multiple scenarios (UMS). Extensive experiments demonstrate that TSJNet achieves outstanding performance on five public datasets (MSRS, M\textsuperscript{3}FD, RoadScene, LLVIP, and TNO) and our UMS dataset. The generated fusion results exhibit favorable visual effects, and compared to state-of-the-art methods, the mean average precision (mAP@0.5) and mean intersection over union (mIoU) for object detection and segmentation, respectively, improve by 7.97\% and 10.88\%.The code and the dataset has been publicly released at https://github.com/XylonXu01/TSJNet.
