Table of Contents
Fetching ...

Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery

Zhicheng Zhao, Yin Huang, Lingma Sun, Chenglong Li, Jin Tang

TL;DR

<3-5 sentence high-level summary> The paper tackles the scale gap in remote sensing object detection, where tiny objects co-occur with large structures and density varies dramatically. It introduces ScaleBridge-Det, a large-detection framework that combines a Routing-Enhanced Mixture Attention (REM) for scale-adaptive feature fusion with a Density-Guided Dynamic Query (DGQ) mechanism for density-aware query allocation, integrated into a DETR-based decoder. The authors demonstrate state-of-the-art performance on AI-TOD-v2 and DTOD, and strong cross-domain robustness on VisDrone, while maintaining balanced detection across extreme scales. These results suggest a scalable path toward robust, cross-domain tiny-to-large object detection in aerial imagery, with potential for lightweight and multi-modal extensions in practical deployments.

Abstract

Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.

Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery

TL;DR

<3-5 sentence high-level summary> The paper tackles the scale gap in remote sensing object detection, where tiny objects co-occur with large structures and density varies dramatically. It introduces ScaleBridge-Det, a large-detection framework that combines a Routing-Enhanced Mixture Attention (REM) for scale-adaptive feature fusion with a Density-Guided Dynamic Query (DGQ) mechanism for density-aware query allocation, integrated into a DETR-based decoder. The authors demonstrate state-of-the-art performance on AI-TOD-v2 and DTOD, and strong cross-domain robustness on VisDrone, while maintaining balanced detection across extreme scales. These results suggest a scalable path toward robust, cross-domain tiny-to-large object detection in aerial imagery, with potential for lightweight and multi-modal extensions in practical deployments.

Abstract

Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.

Paper Structure

This paper contains 17 sections, 11 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Multi-scale object detection challenges in remote sensing imagery. (a)-(b) Representative scenes showing the coexistence of dense tiny objects and sparse large targets. (c)-(d) Visualization of extreme scale imbalance, with dense tiny object clusters (yellow dashed regions) juxtaposed with large structures. Colors indicate scales: dark blue (very tiny $<8^2$), yellow (tiny $8^2$-$16^2$), orange (small $16^2$-$32^2$), orange-red (medium $32^2$-$96^2$), crimson (large $>96^2$).
  • Figure 2: Scale balance comparative analysis with maxdet=300 constraint on a representative UAV scene containing 333 objects across five scale categories. Red boxes indicate missed detections. (a) Ground truth reference. (b) DQ-DETR excels at tiny objects but misses numerous general-scale targets. (c) Ground-DINO achieves strong general-object performance but fails on abundant tiny objects. (d) Our ScaleBridge-Det achieves balanced performance across all scales with minimal missed detections.
  • Figure 3: Overview of the proposed ScaleBridge-Det framework. The framework consists of four main stages: (1) Multi-Expert Feature Extraction using diverse backbones (ResNet, ViT, Swin Transformer) with adaptive routing to select specialized experts based on input characteristics; (2) Routing-Enhanced Mixture Attention (REM) module that performs scale-adaptive feature fusion by dynamically combining expert features through hybrid attention mechanisms, generating robust multi-scale representations; (3) Density-Guided Dynamic Query (DGQ) module that predicts object density maps and adaptively adjusts query positions and numbers according to different object densities, enabling efficient resource allocation across varying density scenarios; (4) DETR Decoder with multi-layer self and cross-attention to refine predictions. The integration of these components enables balanced detection performance across extreme scale variations, from tiny objects to large structures, without performance trade-offs.
  • Figure 4: Comprehensive cross-domain evaluation visualization with category mapping. Left: Training domain showing AI-TOD satellite/remote sensing imagery (100 objects). Middle: Cross-domain transfer with category mapping (AI-TOD vehicle/person mapped to corresponding VisDrone categories) and no fine-tuning. Right: Testing domain with 6 VisDrone UAV test images, each showing 6 model comparisons. Detection quality is color-coded: Green boxes indicate good detections (IoU $>$ 0.5), Yellow boxes show medium detections (0.25 $<$ IoU $<$ 0.5), Blue boxes represent false positives, and Red boxes mark missed detections. Column labels: (a) Ground Truth, (b) Faster R-CNN, (c) DETR, (d) DQ-DETR, (e) CoDETR, (f) ScaleBridge-Det (Ours, highlighted with red border). Our method demonstrates superior cross-domain generalization with significantly more green boxes (good detections) and fewer red boxes (missed detections) compared to baseline methods, validating the effectiveness of scale-adaptive expert routing and density-guided query allocation for cross-domain transfer.
  • Figure 5: Parameter Efficiency Analysis: AP vs. Model Size on AI-TOD test set. The x-axis (log scale) shows model parameters in millions (M), and the y-axis shows Average Precision (AP%). Baseline methods are grouped by architecture type: CNN-based (circles), Transformer-based (squares), YOLO series (diamonds), Tiny-specific methods (triangles), and Foundation model (pentagon). The red star-connected line represents ScaleBridge-Det variants with different expert configurations, all using tiny-object-specific pre-training (DIOR + DOTA).
  • ...and 2 more figures