AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models
Fanglong Yao, Yuanchang Yue, Youzhi Liu, Xian Sun, Kun Fu
TL;DR
AeroVerse introduces the first comprehensive benchmark suite for aerospace embodied intelligence in UAVs, combining real-world and simulated data (AerialAgent-Ego15k and CyberAgent-Ego500k), five UAV-downstream tasks (scene awareness, spatial reasoning, navigational exploration, task planning, motion decision), and extensive evaluation protocols. It defines SkyAgentX, an embodied large model that unifies perception, reasoning, navigating, and planning via aerospace-embodied chain-of-thought and multitask curriculum learning, achieving an average $8.52\%$ improvement across core tasks. The study also provides automated GPT-4-based evaluation metrics (SkyAgent-Eval) and a broad baseline survey across 2D/3D visual-language models, revealing limitations of existing approaches for aerospace tasks while demonstrating the potential of specialized aerospace embodied world models. By bridging simulation and real-world data and offering a public benchmark (AeroVerse), the work establishes a foundation for advancing autonomous UAV perception, cognition, and action with standardized tasks, data, and evaluation practices.
Abstract
Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.
