Table of Contents
Fetching ...

Learning Robust Intervention Representations with Delta Embeddings

Panagiotis Alimisis, Christos Diou

TL;DR

The paper addresses robust causal understanding of interventions in high-dimensional visual scenes under distribution shifts by introducing Causal Delta Embeddings (CDE). CDE represents interventions as latent delta vectors $\delta_a = \phi(\tilde{x}) - \phi(x)$, enforcing independence, sparsity, and invariance, and optimizes a multi-objective loss $\mathcal{L}_{total} = \mathcal{L}_{CE} + \alpha_{contrast}\mathcal{L}_{contrast} + \alpha_{sparsity}\mathcal{L}_{sparsity}$. It presents two architectures (global and patch-wise) built on Vision Transformers, achieving state-of-the-art OOD generalization on the Causal Triplet benchmark and revealing anti-parallel relationships between opposing actions in the delta space without supervision. The approach has practical impact for robust action reasoning in robotics and real-world vision, with future work extending to video and multi-step interventions.

Abstract

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

Learning Robust Intervention Representations with Delta Embeddings

TL;DR

The paper addresses robust causal understanding of interventions in high-dimensional visual scenes under distribution shifts by introducing Causal Delta Embeddings (CDE). CDE represents interventions as latent delta vectors , enforcing independence, sparsity, and invariance, and optimizes a multi-objective loss . It presents two architectures (global and patch-wise) built on Vision Transformers, achieving state-of-the-art OOD generalization on the Causal Triplet benchmark and revealing anti-parallel relationships between opposing actions in the delta space without supervision. The approach has practical impact for robust action reasoning in robotics and real-world vision, with future work extending to video and multi-step interventions.

Abstract

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

Paper Structure

This paper contains 41 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Visualizing Causal Delta Embeddings. Unlike a baseline model that produces entangled representations of action vectors (left), our model learns object invariant action representations (right), that generalize well to out of distribution samples. The model is trained on intervention pairs like those shown at the bottom.
  • Figure 2: Causal Graph for a pair of observations $(x, \tilde{x})$ before and after an action $a$, proposed by liu2023causal. The data generating process is described by a set of latent factors, including global scene level factors $z_s$ and local object level factors $z_n^k$, which are dependent due to confounders $c$. The action is assumed to influence only a few object level causal factors $z_a$ in the scene and the effect of that influence is captured by $\tilde{z}_a$. The red dashed line indicates the structural equation assumed by our CDE approach.
  • Figure 3: Model architecture. Model A (top) computes a global causal delta from CLS tokens. Model B (bottom) computes patch-wise deltas, aggregated to a causal delta. Both feed into a common action classifier.
  • Figure 4: Compositional Distribution Shift in the ProcThor dataset. Blue boxes indicate IID data, while red boxes indicate novel OOD action-object combinations.
  • Figure 5: Systematic Distribution Shift in the ProcThor dataset. Blue boxes indicate IID data, while red boxes indicate novel OOD objects that the model has not encountered during training.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 1: Delta Embedding
  • Definition 2: Causal Delta Embedding