Learning Robust Intervention Representations with Delta Embeddings
Panagiotis Alimisis, Christos Diou
TL;DR
The paper addresses robust causal understanding of interventions in high-dimensional visual scenes under distribution shifts by introducing Causal Delta Embeddings (CDE). CDE represents interventions as latent delta vectors $\delta_a = \phi(\tilde{x}) - \phi(x)$, enforcing independence, sparsity, and invariance, and optimizes a multi-objective loss $\mathcal{L}_{total} = \mathcal{L}_{CE} + \alpha_{contrast}\mathcal{L}_{contrast} + \alpha_{sparsity}\mathcal{L}_{sparsity}$. It presents two architectures (global and patch-wise) built on Vision Transformers, achieving state-of-the-art OOD generalization on the Causal Triplet benchmark and revealing anti-parallel relationships between opposing actions in the delta space without supervision. The approach has practical impact for robust action reasoning in robotics and real-world vision, with future work extending to video and multi-step interventions.
Abstract
Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.
