Table of Contents
Fetching ...

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia

Abstract

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Abstract

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.
Paper Structure (24 sections, 9 equations, 11 figures, 8 tables)

This paper contains 24 sections, 9 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of executable video world modeling. (a) Standard video world models generate rollouts with kinematic artifacts, leading to unreliable IDM-predicted actions, illustrating the executability gap. (b) Our reward-aligned world model optimizes video generation using IDM-derived rewards, producing physically plausible rollouts that result in feasible robot actions.
  • Figure 2: Illustration of how visual artifacts translate into kinematic violations. The plots display the 7-DOF joint angles (in radians) for the left arm, ordered from the base (Joint 1) to the gripper (Joint 7). (Top) A high-quality generation video. The translated actions are smooth and physically executable, yielding a high reward score of 7.94. (Bottom) A failure case exhibiting severe visual artifacts (highlighted in red). Consequently, the IDM translates these visual artifacts into erratic, high-frequency jitter, particularly visible in the distal joints (e.g., Joints 6 and 7), leading to a low reward of 3.04.
  • Figure 3: Qualitative comparison of generated visual plans. Unaligned models (Ours w/o RL, Vidar) often exhibit severe morphological deformations and joint melting (red circles). In contrast, our method maintains strict kinematic integrity (green circle).
  • Figure 4: Real-world deployment and physical fidelity. We visualize the synthesized video sequences (left) alongside their corresponding real-world robot executions (right).
  • Figure 5: Common failure modes observed in unaligned video world models during real-world execution. Implausible kinematics: violations of rigid-body consistency, including (a) morphological deformation, (b) ambiguous joint articulation, and (c) temporal discontinuity. Wrong contact: physically inconsistent object interaction. Incorrect goal: failure to make progress toward the instruction-conditioned objective.
  • ...and 6 more figures