Causal-JEPA: Learning World Models through Object-Level Latent Interventions
Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero
TL;DR
Causal-JEPA introduces an object-centric world model that uses object-level masking as latent interventions to force interaction-aware predictions. By operating on a compact set of object slots with a joint embedding predictor, it achieves strong counterfactual reasoning gains in visual reasoning and substantially improved planning efficiency in MPC, using only a fraction of latent tokens compared with patch-based methods. The authors provide a theoretical account of how object-level masking induces a causal inductive bias via intervention-stable influence neighborhoods, and validate the approach across CLEVRER and Push-T tasks with comprehensive ablations. Overall, C-JEPA blends JEPA-style latent prediction with principled object-level interventions to yield efficient, interaction-aware world models with practical benefits for reasoning and control.
Abstract
World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.
