Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

Qianqian Zhang; Xiaolong Jia; Ahmed M. Abdelmoniem; Li Zhou; Junshe An

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

Qianqian Zhang, Xiaolong Jia, Ahmed M. Abdelmoniem, Li Zhou, Junshe An

TL;DR

Results confirm that ESM-YOLO+ integrates strong performance with practicality for real-time deployment, providing an effective solution for high-performance small-target detection in complex remote sensing scenes.

Abstract

Targets in remote sensing images are usually small, weakly textured, and easily disturbed by complex backgrounds, challenging high-precision detection with general algorithms. Building on our earlier ESM-YOLO, this work presents ESM-YOLO+ as a lightweight visible infrared fusion network. To enhance detection, ESM-YOLO+ includes two key innovations. (1) A Mask-Enhanced Attention Fusion (MEAF) module fuses features at the pixel level via learnable spatial masks and spatial attention, effectively aligning RGB and infrared features, enhancing small-target representation, and alleviating cross-modal misalignment and scale heterogeneity. (2) Training-time Structural Representation (SR) enhancement provides auxiliary supervision to preserve fine-grained spatial structures during training, boosting feature discriminability without extra inference cost. Extensive experiments on the VEDAI and DroneVehicle datasets validate ESM-YOLO+'s superiority. The model achieves 84.71\% mAP on VEDAI and 74.0\% mAP on DroneVehicle, while greatly reducing model complexity, with 93.6\% fewer parameters and 68.0\% lower GFLOPs than the baseline. These results confirm that ESM-YOLO+ integrates strong performance with practicality for real-time deployment, providing an effective solution for high-performance small-target detection in complex remote sensing scenes.

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

TL;DR

Abstract

Paper Structure (20 sections, 11 equations, 9 figures, 6 tables)

This paper contains 20 sections, 11 equations, 9 figures, 6 tables.

Introduction
Related Work
Object Detection Using Multimodal Data
Small Target Detection
Method
Baseline: ESM-YOLO
ESM-YOLO+: Enhanced Method
Mask-Enhanced Attention Fusion Module
Training-Time Structural Representation Enhancement
Loss Function
Result
Datasets
Training details
Assessment Indicators
Results Comparisons
...and 5 more sections

Figures (9)

Figure 1: Overall architecture of the ESM-YOLO+. It comprises three key components: 1) the Mask-Enhanced Attention Fusion (MEAF) Module; 2) Detection Backbone and detection head; and 3) EDSR-based super-resolution branch used only during training to enhance spatial learning and removed at inference for faster detection.
Figure 2: Mask-Enhanced Attention Fusion (MEAF) Module. Pixel-level fusion module in the ESM-YOLO+ model.
Figure 3: Distribution of target sizes in the datasets.
Figure 4: Visual comparison between the ESM-YOLO Model and the ESM-YOLO+ Model. The red cycles represent the false alarms, the yellow ones denote the FP detection results, and the blue ones are FN detection results.
Figure 5: Comparison of P-R Curve for ESM-YOLO (a) and ESM-YOLO+ (b).
...and 4 more figures

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

TL;DR

Abstract

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

Authors

TL;DR

Abstract

Table of Contents

Figures (9)