Table of Contents
Fetching ...

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero

TL;DR

Causal-JEPA introduces an object-centric world model that uses object-level masking as latent interventions to force interaction-aware predictions. By operating on a compact set of object slots with a joint embedding predictor, it achieves strong counterfactual reasoning gains in visual reasoning and substantially improved planning efficiency in MPC, using only a fraction of latent tokens compared with patch-based methods. The authors provide a theoretical account of how object-level masking induces a causal inductive bias via intervention-stable influence neighborhoods, and validate the approach across CLEVRER and Push-T tasks with comprehensive ablations. Overall, C-JEPA blends JEPA-style latent prediction with principled object-level interventions to yield efficient, interaction-aware world models with practical benefits for reasoning and control.

Abstract

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

TL;DR

Causal-JEPA introduces an object-centric world model that uses object-level masking as latent interventions to force interaction-aware predictions. By operating on a compact set of object slots with a joint embedding predictor, it achieves strong counterfactual reasoning gains in visual reasoning and substantially improved planning efficiency in MPC, using only a fraction of latent tokens compared with patch-based methods. The authors provide a theoretical account of how object-level masking induces a causal inductive bias via intervention-stable influence neighborhoods, and validate the approach across CLEVRER and Push-T tasks with comprehensive ablations. Overall, C-JEPA blends JEPA-style latent prediction with principled object-level interventions to yield efficient, interaction-aware world models with practical benefits for reasoning and control.

Abstract

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.
Paper Structure (75 sections, 3 theorems, 17 equations, 5 figures, 8 tables)

This paper contains 75 sections, 3 theorems, 17 equations, 5 figures, 8 tables.

Key Result

Theorem 1

Consider the masked history prediction loss from Eq. eq:obj_extend for object $i$ at time $t$: where $\hat{z}_t^i$ is computed from the observable history $Z_T^{(-i)}$ following Eq. eq:predict, and the expectation is taken with respect to the conditional distribution of $z_t^i$ given $Z_T^{(-i)}$. Under Assumptions assump:main_temporal--assump:finite and Definition def:influence_neighborhood, Con

Figures (5)

  • Figure 1: C-JEPA training pipeline. A frozen encoder extracts object-centric representations, followed by selective masking across history. The predictor recovers masked history slots and predicts future latent states, conditioned on optional auxiliary variables, via a joint masked-history and forward-prediction objective.
  • Figure 2: Object-level latent interventions in C-JEPA. Selected object slots are masked across time, except for a minimal identity anchor, forcing the predictor to infer object dynamics from interactions with other objects and auxiliary variables.
  • Figure 3: Comparison of auxiliary variable integration methods.
  • Figure A1: Slot visualization from object-centric encoders.
  • Figure A2: Diagram of DINO-WM 16, OC-DINO-WM, and OC-JEPA

Theorems & Definitions (8)

  • Remark 1: Causal Interpretation of Object-Level Masking
  • Definition 1: Influence Neighborhood under Masked Completion
  • Theorem 1: Interaction Necessity under Masked History Completion
  • Corollary 1: Discovery of Intervention-Stable Influence Neighborhoods
  • Remark 2
  • Remark 3: Transfer of Bidirectional Training to Forward Prediction
  • Lemma 1: Latent Intervention via Object-Level Masking
  • proof : Proof