Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Mutian Xu; Tianbao Zhang; Tianqi Liu; Zhaoxi Chen; Xiaoguang Han; Ziwei Liu

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu

Abstract

Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Abstract

Paper Structure (60 sections, 2 equations, 10 figures, 7 tables)

This paper contains 60 sections, 2 equations, 10 figures, 7 tables.

Introduction
Related Work
Physical simulation.
Learning world models.
Embodied video-generation models.
3D/4D world models.
Our Approach
Kinematics Control
3D robot asset acquisition.
Kinematics-driven 4D robot trajectory expansion.
Spatial-visual projection.
4D Generative Modeling
Preliminary: Latent Video Diffusion.
Multi-modal latent construction.
4D-aware joint modeling.
...and 45 more sections

Figures (10)

Figure 1: We propose Kinema4D, a new action-conditioned 4D generative embodied simulator. Given an initial world image with a robot at a canonical setup space, and an action sequence, our method generates future robot-world interactions in 4D space. It simulates physically plausible and geometrically consistent interactions between complex robot actions and diverse objects across various spatial constraints, providing a new foundation for advancing next-generation embodied simulation.
Figure 2: Overview of our Kinema4D. 1) Kinematics Control: Given a 3D robot with its URDF at initial canonical setup space, and an action sequence, we drive the 3D robot via kinematics to produce a 4D robot trajectory, which is then projected into a pointmap sequence. This process re-represents raw actions as a spatiotemporal visual signal. 2) 4D Generative Modeling: This signal and the initial main-view world image are sent to a shared VAE encoder, then fused with an occupancy-aligned robot mask and noise, which are denoised by a Diffusion Transformer dit to generate a full future 4D (pointmap+RGB) world sequence.
Figure 3: Samples from our Robo4D-200k dataset. Our dataset provides a comprehensive data foundation by aggregating diverse real-world demonstrations, including DROID khazatsky2024droid, Bridge walke2023bridgedata, and RT-1 rt12022arxiv. We further incorporate the LIBERO libero to synthesize a vast array of successful/failure cases. Each episode captures a complete robot-world interaction (, pick-and-place)—providing the continuous information necessary for robust reasoning. The 4D point clouds viewed from various camera frustums are shown here, demonstrating the spatial precision of our pseudo-annotations.
Figure 4: Qualitative comparison of 2D video synthesis between our Kinema4D and Ctrl-World guo2026ctrl. Our method achieves superior fidelity. In contrast, Ctrl-World exhibits distorted actions and unrealistic environmental transitions.
Figure 5: 4D qualitative comparison between our Kinema4D and TesserAct zhen2025tesseract. Unlike TesserAct that hallucinate outcomes, our Kinema4D precisely reflects Ground-Truth executions, including “near-miss” failure cases. For example, in the bottom left corner example, our model correctly interprets the spatial gap between the gripper and the plant, even when their RGB textures overlap in 2D views.
...and 5 more figures

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Abstract

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Authors

Abstract

Table of Contents

Figures (10)