Table of Contents
Fetching ...

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

Fanglong Yao, Yuanchang Yue, Youzhi Liu, Xian Sun, Kun Fu

TL;DR

AeroVerse introduces the first comprehensive benchmark suite for aerospace embodied intelligence in UAVs, combining real-world and simulated data (AerialAgent-Ego15k and CyberAgent-Ego500k), five UAV-downstream tasks (scene awareness, spatial reasoning, navigational exploration, task planning, motion decision), and extensive evaluation protocols. It defines SkyAgentX, an embodied large model that unifies perception, reasoning, navigating, and planning via aerospace-embodied chain-of-thought and multitask curriculum learning, achieving an average $8.52\%$ improvement across core tasks. The study also provides automated GPT-4-based evaluation metrics (SkyAgent-Eval) and a broad baseline survey across 2D/3D visual-language models, revealing limitations of existing approaches for aerospace tasks while demonstrating the potential of specialized aerospace embodied world models. By bridging simulation and real-world data and offering a public benchmark (AeroVerse), the work establishes a foundation for advancing autonomous UAV perception, cognition, and action with standardized tasks, data, and evaluation practices.

Abstract

Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

TL;DR

AeroVerse introduces the first comprehensive benchmark suite for aerospace embodied intelligence in UAVs, combining real-world and simulated data (AerialAgent-Ego15k and CyberAgent-Ego500k), five UAV-downstream tasks (scene awareness, spatial reasoning, navigational exploration, task planning, motion decision), and extensive evaluation protocols. It defines SkyAgentX, an embodied large model that unifies perception, reasoning, navigating, and planning via aerospace-embodied chain-of-thought and multitask curriculum learning, achieving an average improvement across core tasks. The study also provides automated GPT-4-based evaluation metrics (SkyAgent-Eval) and a broad baseline survey across 2D/3D visual-language models, revealing limitations of existing approaches for aerospace tasks while demonstrating the potential of specialized aerospace embodied world models. By bridging simulation and real-world data and offering a public benchmark (AeroVerse), the work establishes a foundation for advancing autonomous UAV perception, cognition, and action with standardized tasks, data, and evaluation practices.

Abstract

Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.
Paper Structure (27 sections, 5 equations, 16 figures, 5 tables)

This paper contains 27 sections, 5 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: The benchmark suite for the aerospace embodiment world model, AeroVerse, comprises one simulation platform (AeroSimulator), two real-virtual pre-training datasets (AerialAgent-Ego15k and CyberAgent-Ego500k), five downstream task instruction datasets (SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k ), and more than ten evaluation metrics (SkyAgent-Eval).
  • Figure 2: Clear definitions of the five downstream tasks related to aerospace embodied intelligence encompass all aspects of UAV perception, cognition, and action in an end-to-end manner.
  • Figure 3: Following the principle of real-to-sim-to-real, we have developed a simulator called AeroSimulator for aerospace embodied agents, such as UAVs. This simulator features four realistic urban environments: Shanghai, Shenzhen, a school, and a residential area. It is capable of simulating various lighting conditions and weather scenarios while generating visual outputs, including RGB images, depth maps, and segmentation data. This functionality significantly reduces the disparity between simulated environments and the real physical world.
  • Figure 4: The left and right panels illustrate the construction schemes and statistics of the AerialAgent-Ego15k and CyberAgent-Ego500k datasets, respectively.
  • Figure 5: SkyAgent-Scene3k dataset of concrete examples, statistical results, and diversified instructions.
  • ...and 11 more figures