Table of Contents
Fetching ...

Clone Deterministic 3D Worlds

Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen

TL;DR

This work addresses the challenge of faithfully cloning deterministic 3D environments with world models, arguing that long-horizon fidelity is constrained primarily by latent-space geometry rather than the dynamics model. It introduces Geometrically-Regularized World Models (GRWM), which regularize the autoencoder's latent space using temporal contrastive principles to align with the underlying physical state manifold. Through an oracle-style diagnostic and extensive experiments on deterministic Maze and Minecraft environments, GRWM achieves substantially improved long-horizon fidelity, approaching oracle performance and outperforming vanilla VAE-based baselines. The approach offers a practical, plug-in method to enable high-fidelity simulators for robotics planning and controllable content generation in games, with broad implications for reliable, interpretable world modeling.

Abstract

A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically-Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.

Clone Deterministic 3D Worlds

TL;DR

This work addresses the challenge of faithfully cloning deterministic 3D environments with world models, arguing that long-horizon fidelity is constrained primarily by latent-space geometry rather than the dynamics model. It introduces Geometrically-Regularized World Models (GRWM), which regularize the autoencoder's latent space using temporal contrastive principles to align with the underlying physical state manifold. Through an oracle-style diagnostic and extensive experiments on deterministic Maze and Minecraft environments, GRWM achieves substantially improved long-horizon fidelity, approaching oracle performance and outperforming vanilla VAE-based baselines. The approach offers a practical, plug-in method to enable high-fidelity simulators for robotics planning and controllable content generation in games, with broad implications for reliable, interpretable world modeling.

Abstract

A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically-Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.

Paper Structure

This paper contains 37 sections, 4 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Representation quality is the primary bottleneck for world model fidelity. Frame-wise MSE on the Maze 3x3 dataset. (Left) An oracle model using ground-truth states (black dotted) achieves near-zero error, establishing a performance upper bound. In contrast, a standard VAE-based world model (blue dashed) accumulates error rapidly. Our GRWM (green solid) significantly closes this gap by learning a more structurally aligned latent space (Right Bottom), while the VAE's representation remains disorganized (Right Top). For further details, see Section \ref{['exp']}.
  • Figure 2: Top-down visualizations of our three closed environments: M3$\times$3-DET, M9$\times$9-DET, and MC-DET. These maps illustrate the overall layout and are for visualization purposes only; they are not provided as input to the agent. The agent's input is restricted to first-person observations. For a more representative depiction of the agent's surroundings, high-angle perspective views are also included in Appendix, offering a better sense of the environments' three-dimensional structure and scale.
  • Figure 3: Rollout Performance. Frame-wise MSE between predicted and ground-truth trajectories on (a) M3x3-DET, (b) M9x9-DET, and (c) MC-DET datasets. The oracle model (black dotted line), which operates on the true underlying states, establishes a lower bound on error. For all three dynamics models—Diffusion Forcing (DF), Video Diffusion (VD), and Standard Diffusion (SD)—our GRWM (solid lines) consistently outperforms baselines (dashed lines), demonstrating significantly lower error accumulation over 63 steps and substantially closing the performance gap to the oracle.
  • Figure 3: Effect of the projection head. The projection head reduces reconstruction loss while maintaining latent probing performance.
  • Figure 4: Qualitative comparison of medium-horizon rollouts in M9x9-DET. We visualize consecutive frames around frame 100 and frame 400. Our method (GRWM) maintains high similarity to the ground truth throughout, while the baseline VAE-WM gets trapped near the pink wall, indicating that VAE-WM tends to "teleport" between visually similar but distinct locations.
  • ...and 9 more figures