Table of Contents
Fetching ...

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

Chenglizhao Chen, Shaofeng Liang, Runwei Guan, Xiaolou Sun, Haocheng Zhao, Haiyun Jiang, Tao Huang, Henghui Ding, Qing-Long Han

TL;DR

This work introduces AerialMind, the first large-scale RMOT benchmark for UAV scenarios, addressing the gap between ground-level RMOT research and aerial perception. It also presents COALA, a semi-automated annotation framework that leverages LLMs to accelerate high-quality, diverse language labels, and HETrack, a robust RMOT method featuring a Co-evolutionary Fusion Encoder and Scale Adaptive Contextual Refinement to enhance cross-modal fusion and small-object perception in aerial scenes. Experimental results show HETrack achieving state-of-the-art or competitive performance in both in-domain and cross-domain settings, with comprehensive attribute-based evaluations revealing strengths under night, occlusion, and fast-motion conditions. The dataset and methods collectively push toward practical, language-guided aerial perception and tracking for autonomous systems, while highlighting areas for future improvements such as real-time efficiency and deeper language reasoning with LLMs.

Abstract

Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

TL;DR

This work introduces AerialMind, the first large-scale RMOT benchmark for UAV scenarios, addressing the gap between ground-level RMOT research and aerial perception. It also presents COALA, a semi-automated annotation framework that leverages LLMs to accelerate high-quality, diverse language labels, and HETrack, a robust RMOT method featuring a Co-evolutionary Fusion Encoder and Scale Adaptive Contextual Refinement to enhance cross-modal fusion and small-object perception in aerial scenes. Experimental results show HETrack achieving state-of-the-art or competitive performance in both in-domain and cross-domain settings, with comprehensive attribute-based evaluations revealing strengths under night, occlusion, and fast-motion conditions. The dataset and methods collectively push toward practical, language-guided aerial perception and tracking for autonomous systems, while highlighting areas for future improvements such as real-time efficiency and deeper language reasoning with LLMs.

Abstract

Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.

Paper Structure

This paper contains 37 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Overview of the challenges in AerialMind dataset.
  • Figure 3: Overview of the four-stage annotation process in the COALA framework. This framework efficiently constructs the AerialMind dataset through multi-agent collaboration and human-computer interaction.
  • Figure 4: Overview of the HawkEyeTrack. Our key innovations include the Co-evolutionary Fusion Encoder for synergistic vision-language alignment and Scale-Adaptive Contextual Refinement for enhancing the perception of UAV scenarios.
  • Figure 5: Comparison with state-of-the-art models in In-domain Evaluation with different attributes.
  • Figure 6: Qualitative examples on AerialMind. HETrack successfully tracks objects according to the expression.
  • ...and 5 more figures