Table of Contents
Fetching ...

AirScape: An Aerial Generative World Model with Motion Controllability

Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

TL;DR

AirScape introduces the first aerial world model capable of predicting how a 6DoF drone's egocentric observations evolve under motion intentions. It curates an 11k video–intention dataset and trains a video-generation foundation model in two phases: supervised fine-tuning to learn intention controllability, then self-play with a spatio-temporal discriminator to enforce physics-based constraints. Empirically, AirScape outperforms leading video-generation and world-model baselines across $FID$, $FVD$, and $IAR$, achieving over 50% improvement in motion alignment, particularly for 3D rotational dynamics. This work advances embodied spatial imagination for aerial agents and points to real-time, decision-support tools for practical drone operations.

Abstract

How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model--initially devoid of embodied spatial knowledge--into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints. Experimental results demonstrate that AirScape significantly outperforms existing foundation models in 3D spatial imagination capabilities, especially with over a 50% improvement in metrics reflecting motion alignment. The project is available at: https://embodiedcity.github.io/AirScape/.

AirScape: An Aerial Generative World Model with Motion Controllability

TL;DR

AirScape introduces the first aerial world model capable of predicting how a 6DoF drone's egocentric observations evolve under motion intentions. It curates an 11k video–intention dataset and trains a video-generation foundation model in two phases: supervised fine-tuning to learn intention controllability, then self-play with a spatio-temporal discriminator to enforce physics-based constraints. Empirically, AirScape outperforms leading video-generation and world-model baselines across , , and , achieving over 50% improvement in motion alignment, particularly for 3D rotational dynamics. This work advances embodied spatial imagination for aerial agents and points to real-time, decision-support tools for practical drone operations.

Abstract

How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model--initially devoid of embodied spatial knowledge--into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints. Experimental results demonstrate that AirScape significantly outperforms existing foundation models in 3D spatial imagination capabilities, especially with over a 50% improvement in metrics reflecting motion alignment. The project is available at: https://embodiedcity.github.io/AirScape/.

Paper Structure

This paper contains 14 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: In 3D space, AirScape can predict the sequence of observations that would result if a six-degree-of-freedom aerial agent executed a series of actions to achieve an intention, based on current visual observations. AirScape can handle diverse actions (translation, rotation, and their combinations), environments (rural, urban), viewpoints (top-down, horizon), and lighting conditions (daytime, dusk, nighttime), simulating embodied observation characteristics such as perspective and parallax.
  • Figure 2: The proposed dataset includes samples with diverse actions, areas, scenes, and tasks.
  • Figure 3: a. Dataset construction pipeline. b. Proportions of different actions and various scenarios in the dataset. c. Length distribution of intention prompts in the dataset. d. Word cloud of intention prompts in the dataset.
  • Figure 4: The proposed two-phase training schedule aims to develop an aerial world model that is motion-controllable while adhering to physical spatio-temporal constraints. Phase 1 involves supervised fine-tuning (SFT) on the aerial video-intention pair dataset introduced in Section \ref{['sec:dataset']}. Phase 2 uses rejection sampling to roll out high-quality samples for iterative SFT. We give an example of this process: The initial frame depicts windsurf boards on the sea, with the drone intending to move forward while keeping them in focus. Among the generated videos, the first is unrealistic as a windsurf board moves like a speedboat, and the last is unreasonable as a board flies into the air. The second video is consistent with real-world physics, with the drone adjusting its gimbal downward to keep the boards in view, making them appear larger in the egocentric perspective.
  • Figure 5: Case analysis of our AirScape and baseline methods, highlighting three common generation issues: limited motion amplitude, shape distortion of spatial objects, and temporal discontinuity.
  • ...and 1 more figures