Counterfactual World Models via Digital Twin-conditioned Video Diffusion
Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath
TL;DR
This work extends world models to support counterfactual reasoning by introducing CWMDT, a three-stage framework that separates perception, reasoning, and synthesis. It constructs digital twin representations from video frames, uses an LLM to reason about how interventions propagate through time, and conditions a video diffusion model on the edited twins to synthesize counterfactual videos. The approach formalizes counterfactual world models, demonstrates state-of-the-art performance on RVEBench and FiVE, and conducts thorough ablations to show the value of explicit scene representations and reasoning. This framework enables safer, more robust evaluation of physical AI by enabling explicit hypothetical scenario analysis in video-forward simulation.
Abstract
World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.
