Table of Contents
Fetching ...

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

TL;DR

Causal-Adapter provides a modular approach to faithful counterfactual image generation by injecting an explicit structural causal model into a frozen diffusion backbone through a trainable adapter. It introduces Prompt Aligned Injection (PAI) and Conditioned Token Contrast (CTC) to align causal attributes with textual embeddings and disentangle attribute factors, enabling precise, identity-preserving edits across synthetic and real-world data. The method achieves state-of-the-art performance on Pendulum, CelebA, and ADNI, with substantial improvements in intervention effectiveness, realism, and minimal unintended changes, as demonstrated by comprehensive ablations. This work offers scalable, generalizable counterfactual editing that leverages abduction–action–prediction in diffusion models, enhancing applicability in critical domains such as medical imaging and biometric editing while addressing safety and reproducibility considerations.

Abstract

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

TL;DR

Causal-Adapter provides a modular approach to faithful counterfactual image generation by injecting an explicit structural causal model into a frozen diffusion backbone through a trainable adapter. It introduces Prompt Aligned Injection (PAI) and Conditioned Token Contrast (CTC) to align causal attributes with textual embeddings and disentangle attribute factors, enabling precise, identity-preserving edits across synthetic and real-world data. The method achieves state-of-the-art performance on Pendulum, CelebA, and ADNI, with substantial improvements in intervention effectiveness, realism, and minimal unintended changes, as demonstrated by comprehensive ablations. This work offers scalable, generalizable counterfactual editing that leverages abduction–action–prediction in diffusion models, enhancing applicability in critical domains such as medical imaging and biometric editing while addressing safety and reproducibility considerations.

Abstract

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

Paper Structure

This paper contains 45 sections, 9 equations, 29 figures, 8 tables, 4 algorithms.

Figures (29)

  • Figure 1: Non-causal editing modifies only the target attribute (e.g. age, gender); causal editing propagates changes to related attributes (e.g. beard, baldness) enforced by the causal graph.
  • Figure 2: A sketch comparison of counterfactual image generation methods based on: (a) VAE or GAN, which fail to achieve high-fidelity results. (b) Diffusion SCM and (c) Diffusion autoencoder, which are sensitive to spurious correlations. (d) T2I based editing, which requires heavy prompt engineering. (e) Vanilla Causal-Adapter, which injects causal attributes into image-embedding. (f) Causal-Adapter with attribute regularization, which injects causal attributes into learnable textual embeddings with contrastive optimization. Detailed discussion is presented in Appendix \ref{['Appendix:Related Works']}.
  • Figure 3: Motivational study and preliminary counterfactual generation results between T2I methods and Causal-Adapter. (a) Fine-grained anatomical counterfactual editing of brain ventricular volume using inversion-based editing (NTI mokady2023null), multi-concept prompt-learning editing (MCPL jin2024image), and our approach. (b) Comparison of counterfactual editing results on human faces. (c) Averaged cross-attention maps from the base Causal-Adapter and the Causal-Adapter with regularizers. Full results and technical details are presented in Appendix \ref{['Appendix:Full Motivational Study Results']}.
  • Figure 4: Method overview. A counterfactual prompt and input image $x$ are fed into a pretrained text-to-image diffusion model with a learnable Causal-Adapter $\ddot{\epsilon}_\psi$. Causal mechanisms, modeled over a known causal graph and attributes ${y_i}$, are injected into token embeddings via Prompt-Aligned Injection (PAI) to align semantic and spatial features. The adapter $\ddot{\epsilon}_\psi$ operates alongside the frozen diffusion U-Net $\epsilon_\theta$, optimized with MSE $\mathcal{L}_{\text{DM}}$ and a Conditioned Token Contrastive (CTC) loss $\mathcal{L}_{\text{CTC}}$ to enforce disentanglement. At inference, interventions on $y_i$ update token embeddings, and the counterfactual $\bar{x}$ is generated using the abducted exogenous noise $z^{\star}_{t}$. Optionally, Attention Guidance (AG) updates the cross-attention map of intervened tokens (e.g. age, beard, bald) to achieve localized editing and preserving non-intervened attributes identity (e.g. human, gender).
  • Figure 5: Pendulum counterfactuals with traversal editing along each attribute.
  • ...and 24 more figures