SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments
Xinyi Li, Zaishuo Xia, Weyl Lu, Chenjie Hao, Yubei Chen
TL;DR
This work introduces the SmallWorld benchmark to rigorously evaluate world models in isolation, removing reward signals and using fully observable, controlled dynamics to reveal fundamental predictive and physical reasoning capabilities. By testing four architectures (RSSM, Transformer, Diffusion, Neural ODE) across six domains, it demonstrates clear differences in long-horizon stability and domain-specific strengths, notably showing diffusion-based and neural-ODE approaches excelling in structured physics while transformer-based models offer robust sequence modeling. The study also highlights trade-offs between robustness and precision and emphasizes the need for principled evaluation beyond policy-driven, reward-based metrics. Overall, SmallWorld provides a valuable, interpretable testbed to guide future advances in dynamics understanding and representation learning for world models.
Abstract
Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.
