Table of Contents
Fetching ...

RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes

Zhichao Sun, Yepeng Liu, Zhiling Su, Huachao Zhu, Yuliang Gu, Yuda Zou, Zelong Liu, Gui-Song Xia, Bo Du, Yongchao Xu

TL;DR

RefDrone tackles the gap in referring expression comprehension for drone imagery by introducing a challenging aerial REC benchmark with multi-target/no-target and context-rich expressions. It introduces RDAnnotator, a semi-automated LVLM-based annotation pipeline, and NGDINO, an extended grounding model incorporating explicit object-count reasoning via number heads, number-queries, and number cross-attention. Extensive zero-shot and fine-tuning experiments demonstrate that RefDrone is significantly more difficult than ground-level REC benchmarks, while NGDINO achieves state-of-the-art performance on RefDrone and competitive gains on gRefCOCO and RSVG. The work sets the stage for robust, scalable drone-grounded REC research with practical implications for Embodied AI in aerial environments.

Abstract

Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground-level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi-scale and small-scale target detection; 2) multi-target and no-target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi-agent system), a semi-automated annotation tool for REC tasks. RDAgent ensures high-quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi-target and no-target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state-of-the-art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code are be publicly at https://github.com/sunzc-sunny/refdrone.

RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes

TL;DR

RefDrone tackles the gap in referring expression comprehension for drone imagery by introducing a challenging aerial REC benchmark with multi-target/no-target and context-rich expressions. It introduces RDAnnotator, a semi-automated LVLM-based annotation pipeline, and NGDINO, an extended grounding model incorporating explicit object-count reasoning via number heads, number-queries, and number cross-attention. Extensive zero-shot and fine-tuning experiments demonstrate that RefDrone is significantly more difficult than ground-level REC benchmarks, while NGDINO achieves state-of-the-art performance on RefDrone and competitive gains on gRefCOCO and RSVG. The work sets the stage for robust, scalable drone-grounded REC research with practical implications for Embodied AI in aerial environments.

Abstract

Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground-level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi-scale and small-scale target detection; 2) multi-target and no-target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi-agent system), a semi-automated annotation tool for REC tasks. RDAgent ensures high-quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi-target and no-target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state-of-the-art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code are be publicly at https://github.com/sunzc-sunny/refdrone.

Paper Structure

This paper contains 27 sections, 5 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Examples of the various challenges in RefDrone dataset.
  • Figure 2: The overview of the RefDrone annotation process with RDAnnotator. Multiple specialized LVLM-based modules collaborate both with each other and human annotators through iterative feedback loops to generate high-quality annotations.
  • Figure 3: Object number distribution per expression in gRefCOCO grefcoco and RefDrone datasets.
  • Figure 4: Object size distribution analysis. (a) Object size distribution in RefDrone dataset (small: $< 32^2 = 1024$ pixels, normal: 1024 to 9216 pixels, large: $> 96^2 = 9216$ pixels). (b) Object size histograms in RefDrone and gRefCOCO grefcoco datasets.
  • Figure 5: Word frequency visualization in RefDrone dataset.
  • ...and 4 more figures