ChronoDreamer: Action-Conditioned World Model as an Online Simulator for Robotic Planning
Zhenhao Zhou, Dan Negrut
TL;DR
ChronoDreamer introduces an action-conditioned world model that jointly predicts future RGB frames, contact maps, and joint angles for contact-rich robotic manipulation. It employs a spatial-temporal transformer with MaskGIT-style masked prediction and renders contact as depth-weighted Gaussian splats, enabling image-native supervision. At inference, rollouts are validated by a vision-language collision judge to reject unsafe actions online, integrating a pragmatic safety filter into planning. Evaluated on the DreamerBench dataset, the approach demonstrates spatially coherent non-contact dynamics and plausible contact predictions, with the LLM-based judge effectively distinguishing collision trajectories. This work advances planning-with-imagination in robotics by coupling dense perceptual predictions with physics- and task-aware validation in real-time.
Abstract
We present ChronoDreamer, an action-conditioned world model for contact-rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial-temporal transformer trained with MaskGIT-style masked prediction. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into a camera-aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios. Qualitative results demonstrate that the model preserves spatial coherence during non-contact motion and generates plausible contact predictions, while the LLM-based judge distinguishes collision from non-collision trajectories.
