Clone Deterministic 3D Worlds
Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen
TL;DR
This work addresses the challenge of faithfully cloning deterministic 3D environments with world models, arguing that long-horizon fidelity is constrained primarily by latent-space geometry rather than the dynamics model. It introduces Geometrically-Regularized World Models (GRWM), which regularize the autoencoder's latent space using temporal contrastive principles to align with the underlying physical state manifold. Through an oracle-style diagnostic and extensive experiments on deterministic Maze and Minecraft environments, GRWM achieves substantially improved long-horizon fidelity, approaching oracle performance and outperforming vanilla VAE-based baselines. The approach offers a practical, plug-in method to enable high-fidelity simulators for robotics planning and controllable content generation in games, with broad implications for reliable, interpretable world modeling.
Abstract
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically-Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.
