Table of Contents
Fetching ...

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, Yu-Gang Jiang

TL;DR

This work reframes text-based image editing (TBIE) as a counterfactual inference task to formalize the requirement of minimal visual changes while editing toward a prompt. It introduces the Doubly Abductive Counterfactual (DAC) framework, which decouples image content (U) and semantic change (Delta) via Abduction-1 and Abduction-2, implemented with UNet LoRA and CLIP LoRA, respectively; the later inversion (Delta' = -Delta) enables effective editing through DDIM sampling. Empirical results show DAC achieves a superior trade-off between editability and fidelity across diverse edits, outperforming or matching state-of-the-art methods on qualitative and quantitative metrics and is validated by a user study. The approach advances TBIE by providing a formal, efficient, and versatile editing paradigm with potential extensions to faster diffusion and example-based editing.

Abstract

We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.

Doubly Abductive Counterfactual Inference for Text-based Image Editing

TL;DR

This work reframes text-based image editing (TBIE) as a counterfactual inference task to formalize the requirement of minimal visual changes while editing toward a prompt. It introduces the Doubly Abductive Counterfactual (DAC) framework, which decouples image content (U) and semantic change (Delta) via Abduction-1 and Abduction-2, implemented with UNet LoRA and CLIP LoRA, respectively; the later inversion (Delta' = -Delta) enables effective editing through DDIM sampling. Empirical results show DAC achieves a superior trade-off between editability and fidelity across diverse edits, outperforming or matching state-of-the-art methods on qualitative and quantitative metrics and is validated by a user study. The approach advances TBIE by providing a formal, efficient, and versatile editing paradigm with potential extensions to faster diffusion and example-based editing.

Abstract

We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.
Paper Structure (17 sections, 9 equations, 21 figures, 2 tables)

This paper contains 17 sections, 9 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Illustration of the TBIE task. (a): source image $I$. (b) and (c): edited images according to the target prompt "a castle covered by snow". TBIE considers (b) to be better than (c).
  • Figure 2: Counterfactual inference framework for TBIE.
  • Figure 3: The editability of counterfactual $I' = G(P', U)$ decreases when the abductive iteration of $\arg\min_{U} \|G(P, U)- I\|$ increases.
  • Figure 4: The proposed Doubly Abductive Counterfactual inference framework (DAC).
  • Figure 5: Comparison of TBIE qualitative examples across the 6 editing types (only prompt $P'$ shown) between our DAC and three SOTAs with a similar design philosophy (Table \ref{['tab:editing_related_work']}). For fairness, examples are chosen based on their best visual quality from various random seeds. See Section \ref{['sec:exp-1']} for analysis and Appendix for the example selection details.
  • ...and 16 more figures