Table of Contents
Fetching ...

Small Object Detection in Complex Backgrounds with Multi-Scale Attention and Global Relation Modeling

Wenguang Tao, Xiaotian Wang, Tian Yan, Yi Wang, Jie Yan

TL;DR

Extensive experiments conducted on the large-scale RGBT-Tiny benchmark demonstrate that the proposed method consistently outperforms existing state-of-the-art detectors under both IoU-based and scale-adaptive evaluation metrics.

Abstract

Small object detection under complex backgrounds remains a challenging task due to severe feature degradation, weak semantic representation, and inaccurate localization caused by downsampling operations and background interference. Existing detection frameworks are mainly designed for general objects and often fail to explicitly address the unique characteristics of small objects, such as limited structural cues and strong sensitivity to localization errors. In this paper, we propose a multi-level feature enhancement and global relation modeling framework tailored for small object detection. Specifically, a Residual Haar Wavelet Downsampling module is introduced to preserve fine-grained structural details by jointly exploiting spatial-domain convolutional features and frequency-domain representations. To enhance global semantic awareness and suppress background noise, a Global Relation Modeling module is employed to capture long-range dependencies at high-level feature stages. Furthermore, a Cross-Scale Hybrid Attention module is designed to establish sparse and aligned interactions across multi-scale features, enabling effective fusion of high-resolution details and high-level semantic information with reduced computational overhead. Finally, a Center-Assisted Loss is incorporated to stabilize training and improve localization accuracy for small objects. Extensive experiments conducted on the large-scale RGBT-Tiny benchmark demonstrate that the proposed method consistently outperforms existing state-of-the-art detectors under both IoU-based and scale-adaptive evaluation metrics. These results validate the effectiveness and robustness of the proposed framework for small object detection in complex environments.

Small Object Detection in Complex Backgrounds with Multi-Scale Attention and Global Relation Modeling

TL;DR

Extensive experiments conducted on the large-scale RGBT-Tiny benchmark demonstrate that the proposed method consistently outperforms existing state-of-the-art detectors under both IoU-based and scale-adaptive evaluation metrics.

Abstract

Small object detection under complex backgrounds remains a challenging task due to severe feature degradation, weak semantic representation, and inaccurate localization caused by downsampling operations and background interference. Existing detection frameworks are mainly designed for general objects and often fail to explicitly address the unique characteristics of small objects, such as limited structural cues and strong sensitivity to localization errors. In this paper, we propose a multi-level feature enhancement and global relation modeling framework tailored for small object detection. Specifically, a Residual Haar Wavelet Downsampling module is introduced to preserve fine-grained structural details by jointly exploiting spatial-domain convolutional features and frequency-domain representations. To enhance global semantic awareness and suppress background noise, a Global Relation Modeling module is employed to capture long-range dependencies at high-level feature stages. Furthermore, a Cross-Scale Hybrid Attention module is designed to establish sparse and aligned interactions across multi-scale features, enabling effective fusion of high-resolution details and high-level semantic information with reduced computational overhead. Finally, a Center-Assisted Loss is incorporated to stabilize training and improve localization accuracy for small objects. Extensive experiments conducted on the large-scale RGBT-Tiny benchmark demonstrate that the proposed method consistently outperforms existing state-of-the-art detectors under both IoU-based and scale-adaptive evaluation metrics. These results validate the effectiveness and robustness of the proposed framework for small object detection in complex environments.
Paper Structure (13 sections, 11 equations, 5 figures, 6 tables)

This paper contains 13 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overall Framework Diagram. The input image is fed into the backbone network for feature extraction after passing through the residual Haar wavelet downsampling module (RHWD). Subsequently, the P3, P4, and P5 features are respectively input into the global relation modeling module (GRM) and cross-scale hybrid attention module (CSHA) for feature enhancement, and the detection results are finally obtained via the feature pyramid network (FPN).
  • Figure 2: Global relation modeling module (GRM).
  • Figure 3: Cross-scale hybrid attention module (CSHA).
  • Figure 4: Schematic diagram of sampling points in P3, P4, and P5 Layers (Only 4 Heads of 8 are Shown).
  • Figure 5: Visualized comparison of results between the baseline and our method on the RGBT-Tiny dataset.