Table of Contents
Fetching ...

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

TL;DR

This work extends world models to support counterfactual reasoning by introducing CWMDT, a three-stage framework that separates perception, reasoning, and synthesis. It constructs digital twin representations from video frames, uses an LLM to reason about how interventions propagate through time, and conditions a video diffusion model on the edited twins to synthesize counterfactual videos. The approach formalizes counterfactual world models, demonstrates state-of-the-art performance on RVEBench and FiVE, and conducts thorough ablations to show the value of explicit scene representations and reasoning. This framework enables safer, more robust evaluation of physical AI by enabling explicit hypothetical scenario analysis in video-forward simulation.

Abstract

World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

TL;DR

This work extends world models to support counterfactual reasoning by introducing CWMDT, a three-stage framework that separates perception, reasoning, and synthesis. It constructs digital twin representations from video frames, uses an LLM to reason about how interventions propagate through time, and conditions a video diffusion model on the edited twins to synthesize counterfactual videos. The approach formalizes counterfactual world models, demonstrates state-of-the-art performance on RVEBench and FiVE, and conducts thorough ablations to show the value of explicit scene representations and reasoning. This framework enables safer, more robust evaluation of physical AI by enabling explicit hypothetical scenario analysis in video-forward simulation.

Abstract

World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

Paper Structure

This paper contains 26 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Method overview for CWMDT. Our approach consists of three stages. (1) Digital twin representation construction: Vision models extract structured scene representations $s_t$ from video frames $v_t$. (2) Counterfactual reasoning: An LLM processes intervention queries to predict temporal evolution, generating modified digital twin representations $\tilde{s}_{t:t+k}$. (3) Video synthesis: A fine-tuned diffusion model generates counterfactual videos $\tilde{v}_{t:t+k}$ conditioned on the edited first frame $\tilde{v}_t$ and the modified digital twin representation $\tilde{s}_{t:t+k}$.
  • Figure 2: Qualitative comparison of counterfactual world model capabilities across different methods. Two intervention scenarios test whether models can predict alternative temporal sequences. CWMDT correctly generates counterfactual trajectories. Compared methods fail to execute these interventions. Red boxes indicate regions where intervention effects should appear.
  • Figure 3: Demonstration of diverse counterfactual trajectory generation from a single intervention. Given the query to replace a car with a motorcycle, CWMDT produces three distinct plausible scenarios: maintaining the original motion pattern (Case 1), accelerating beyond the frame boundary and reentering (Case 2), and executing agile cornering maneuvers (Case 3). Each trajectory respects physical constraints while exploring different behavioral possibilities that could arise from the same initial intervention. Baseline methods either fail to execute the vehicle replacement or produce visually inconsistent results, lacking the ability to reason about multiple plausible outcomes.
  • Figure 4: Diverse counterfactual scenarios generated from a single original video sequence using the proposed CWMDT.
  • Figure 5: Evolution of digital twin representations through the CWMDT. Left: Original digital twin representation extracted from video, containing per-frame descriptions, numerical traces for area, depth, and centroid coordinates. Middle: Condensed representation retaining scene summaries and compressed spatial attributes through compact notation for regions, motion paths, and depth spans. Right: LLM-edited representation reflecting the counterfactual intervention. The LLM modifies not only textual descriptions but also spatial trajectories, depth evolution, and motion patterns to maintain physical coherence under the hypothetical condition.