Table of Contents
Fetching ...

Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection

Guoting Wei, Xia Yuan, Yang Zhou, Haizhao Jing, Yu Liu, Xianbiao Qi, Chunxia Zhao, Haokui Zhang, Rong Xiao

TL;DR

This work addresses the gap between open-vocabulary aerial detection and remote sensing visual grounding by proposing OTA-Det, a unified open-text aerial detection framework. It introduces a task reformulation that converts RSVG into a joint classification-localization problem and a dense semantic alignment strategy that enables explicit multi-granular vision-language correspondence through attribute-level decomposition and unified supervision matrices. The architecture, based on RT-DETR, employs a multi-modality backbone and a decoupled multi-granular head to produce both holistic and attribute-level grounding signals, optimized with MAL-based losses. Joint training on OVAD and RSVG data yields state-of-the-art performance across six benchmarks while maintaining real-time inference at 34 FPS, demonstrating practical applicability for complex aerial scenes and compositional textual queries.

Abstract

Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.

Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection

TL;DR

This work addresses the gap between open-vocabulary aerial detection and remote sensing visual grounding by proposing OTA-Det, a unified open-text aerial detection framework. It introduces a task reformulation that converts RSVG into a joint classification-localization problem and a dense semantic alignment strategy that enables explicit multi-granular vision-language correspondence through attribute-level decomposition and unified supervision matrices. The architecture, based on RT-DETR, employs a multi-modality backbone and a decoupled multi-granular head to produce both holistic and attribute-level grounding signals, optimized with MAL-based losses. Joint training on OVAD and RSVG data yields state-of-the-art performance across six benchmarks while maintaining real-time inference at 34 FPS, demonstrating practical applicability for complex aerial scenes and compositional textual queries.

Abstract

Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.
Paper Structure (31 sections, 13 equations, 5 figures, 5 tables)

This paper contains 31 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of task paradigms and capabilities. Left: Top two images show comparison between OVAD and OTA-Det detection results. Bottom two images show comparison between RSVG and OTA-Det detection results. OVAD supports One-to-Many Detection and Multi-Query Inference but is limited to category-level semantics, while RSVG handles Complex Expressions but is restricted to Single-Target scenarios. OTA-Det combines the strengths of both paradigms with extended capabilities including Attribute-Set Interaction, Multi-Label Association, and Real-Time Inference at 34 FPS. Right: Performance comparison on LAE-80C (top) and AerialVG (bottom) benchmarks, demonstrating OTA-Det's superior accuracy and efficiency.
  • Figure 2: Overview of the proposed OTA-Det framework.(A) Image-Level Annotation Aggregation restructures sparse sentence-level triplets $\langle I, E_k, b_k \rangle$ into dense image-level query sets $\mathcal{T}_{E}$ with labeled ground truth $\mathcal{G}$, transforming RSVG into a joint classification-localization task structurally aligned with OVAD. (B) Attribute-Level Data Decomposition leverages an LLM to parse referring expressions in $\mathcal{T}_{E}$ into structured, target-centric attribute sets $\mathcal{A}_{E}$, enabling fine-grained alignment and mitigating semantic pseudo-alignment. (C) The OTA-Det Architecture processes multi-granular inputs through a Multi-Modality Backbone and employs a Decoupled Multi-Granular Head to compute independent similarity logits for holistic queries ($\mathbf{S}_{query}$) and fine-grained attributes ($\mathbf{S}_{attr}$). The Unified Correspondence Matrices $\mathbf{M}_Q$ and $\mathbf{M}_A$ serve as supervision targets, optimized via the MAL objective.
  • Figure 3: Visualization of detection results with and without Image-Level Annotation Aggregation. Without aggregation (left), the model produces numerous false positives with high confidence due to sparse supervision. With aggregation (right), the model exhibits improved discriminative capability and accurately localizes the targets corresponding to each referring expression.
  • Figure 4: Qualitative results on open-vocabulary aerial detection. OTA-Det exhibits coarse category-level semantic understanding, supporting multi-query inference and one-to-many detection across varying scales and dense scenes.
  • Figure 5: Qualitative results on visual grounding. Top: Complex expressions with spatial relationships, demonstrating RSVG capability with one-to-many detection. Bottom: Multi-query inference with attribute-set interaction and multi-label association.