Table of Contents
Fetching ...

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, Bin Zhao

TL;DR

This work introduces AerialVG, the first real-world aerial visual grounding benchmark, addressing challenges from wide field-of-view and small object sizes to complex spatial relations among multiple entities. It presents a dataset of 5,000 high-resolution UAV images with approximately 50,000 descriptive annotations and 103,000 annotated objects, underscoring the need for spatial reasoning in aerial contexts. To tackle these challenges, the authors propose a model that combines Hierarchical Cross-Attention with a Relation-Aware Grounding module, trained in a two-stage regime on both standard VG datasets and the new AerialVG data. Experimental results show strong performance gains on AerialVG and competitive results on RefCOCO benchmarks, highlighting the pivotal role of spatial relations in aerial grounding and offering a foundation for UAV perception and navigation research.

Abstract

Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

TL;DR

This work introduces AerialVG, the first real-world aerial visual grounding benchmark, addressing challenges from wide field-of-view and small object sizes to complex spatial relations among multiple entities. It presents a dataset of 5,000 high-resolution UAV images with approximately 50,000 descriptive annotations and 103,000 annotated objects, underscoring the need for spatial reasoning in aerial contexts. To tackle these challenges, the authors propose a model that combines Hierarchical Cross-Attention with a Relation-Aware Grounding module, trained in a two-stage regime on both standard VG datasets and the new AerialVG data. Experimental results show strong performance gains on AerialVG and competitive results on RefCOCO benchmarks, highlighting the pivotal role of spatial relations in aerial grounding and offering a foundation for UAV perception and navigation research.

Abstract

Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.

Paper Structure

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Example of AerialVG Dataset. There are many red cars in the picture. In order to achieve accurate object positioning, we need to use the positional relationship of surrounding auxiliary objects to assist reasoning. To this end, we accurately annotated the data to ensure that the spatial relationship between objects can be fully captured, thereby improving the accuracy of AerialVG.
  • Figure 2: Resolution distribution and word cloud of AerialVG dataset. Most of the images in AerialVG dataset are high-resolution, and the most frequently appearing words are vehicle type, color, and location.
  • Figure 3: Comparison between different datasets. Words refers to the average number of words in the text; Objects refer to the average number of objects in each image; Area refers to the proportion of irrelevant areas; Resolution refers to the maximum resolution of the dataset images and Captions refers to the average number of annotations contained in each image.
  • Figure 4: The architecture of AerialVG model. Hierarchical Cross Attention directs the model’s focus to potential object locations, while the Relation-Aware Grounding module enables the model to perceive spatial relationships between objects.
  • Figure 5: Example of Qualitative Test. Annotation: A blue car is parking on the right road with a white sedan above it. The model output score for this result is 0.70.