AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, Bin Zhao
TL;DR
This work introduces AerialVG, the first real-world aerial visual grounding benchmark, addressing challenges from wide field-of-view and small object sizes to complex spatial relations among multiple entities. It presents a dataset of 5,000 high-resolution UAV images with approximately 50,000 descriptive annotations and 103,000 annotated objects, underscoring the need for spatial reasoning in aerial contexts. To tackle these challenges, the authors propose a model that combines Hierarchical Cross-Attention with a Relation-Aware Grounding module, trained in a two-stage regime on both standard VG datasets and the new AerialVG data. Experimental results show strong performance gains on AerialVG and competitive results on RefCOCO benchmarks, highlighting the pivotal role of spatial relations in aerial grounding and offering a foundation for UAV perception and navigation research.
Abstract
Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.
