Table of Contents
Fetching ...

Aether: Geometric-Aware Unified World Modeling

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Tong He

TL;DR

Aether introduces a geometry-aware, multi-task world model that unifies 4D reconstruction, action-conditioned prediction, and goal-driven planning by post-training a diffusion backbone on synthetic 4D data. It provides a robust 4D annotation pipeline and uses camera trajectories as the action space, enabling zero-shot transfer to real-world scenes with competitive reconstruction quality and enhanced planning and prediction. The method demonstrates strong zero-shot depth and pose estimation and superior generation/planning performance against baselines, highlighting the value of integrating reconstruction objectives into world modeling. The work lays groundwork for scalable, physically-grounded world models trained with synthetic data and capable of real-time reasoning and planning.

Abstract

The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Aether: Geometric-Aware Unified World Modeling

TL;DR

Aether introduces a geometry-aware, multi-task world model that unifies 4D reconstruction, action-conditioned prediction, and goal-driven planning by post-training a diffusion backbone on synthetic 4D data. It provides a robust 4D annotation pipeline and uses camera trajectories as the action space, enabling zero-shot transfer to real-world scenes with competitive reconstruction quality and enhanced planning and prediction. The method demonstrates strong zero-shot depth and pose estimation and superior generation/planning performance against baselines, highlighting the value of integrating reconstruction objectives into world modeling. The work lays groundwork for scalable, physically-grounded world models trained with synthetic data and capable of real-time reasoning and planning.

Abstract

The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Paper Structure

This paper contains 32 sections, 11 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of Aether, trained entirely on synthetic data. The figure highlights its three key capabilities: 4D reconstruction, action-conditioned 4D prediction, and visual planning, all demonstrated on unseen real-world data. The 4D reconstruction examples are derived from MovieGen polyak2025moviegencastmedia and Veo 2 veo2 generated videos, while the action-conditioned prediction uses an observation image from a university classroom. The visual planning example utilizes observation and goal images from an office building. Better viewed when zoomed in. Additional visualizations can be found on our website.
  • Figure 2: Some visualization results of data annotated through our pipeline. Better viewed when zoomed in.
  • Figure 3: Our robust automatic camera annotation pipeline.
  • Figure 4: The overall pipeline of Aether. With different condition combinations, Aether can serve different tasks.
  • Figure 5: More reconstruction visualizations.
  • ...and 2 more figures