Table of Contents
Fetching ...

SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning

Tsu-Jui Fu, Xin Eric Wang, Scott Grafton, Miguel Eckstein, William Yang Wang

TL;DR

This paper tackles data scarcity in iterative language-based image editing (ILBIE) by introducing Self-Supervised Counterfactual Reasoning (SSCR). SSCR enables the editor to anticipate outcomes under counterfactual instructions and uses Cross-Task Consistency (CTC) with an Iterative Explainer to provide explicit token-level supervision without ground-truth counterfactuals. An Iterative Editor, guided by instruction histories, is trained with GAN losses, while an Instruction Intervention pipeline generates diverse counterfactuals. Across i-CLEVR and CoDraw, SSCR achieves state-of-the-art results in object identity and position and demonstrates strong data-efficiency, including parity with full-data baselines at 50% data. The approach is model-agnostic and can be integrated with different editors for real-world language-based image editing.

Abstract

Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative instructions to edit images step by step. Data scarcity is a significant issue for ILBIE as it is challenging to collect large-scale examples of images before and after instruction-based changes. However, humans still accomplish these editing tasks even when presented with an unfamiliar image-instruction pair. Such ability results from counterfactual thinking and the ability to think about alternatives to events that have happened already. In this paper, we introduce a Self-Supervised Counterfactual Reasoning (SSCR) framework that incorporates counterfactual thinking to overcome data scarcity. SSCR allows the model to consider out-of-distribution instructions paired with previous images. With the help of cross-task consistency (CTC), we train these counterfactual instructions in a self-supervised scenario. Extensive results show that SSCR improves the correctness of ILBIE in terms of both object identity and position, establishing a new state of the art (SOTA) on two IBLIE datasets (i-CLEVR and CoDraw). Even with only 50% of the training data, SSCR achieves a comparable result to using complete data.

SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning

TL;DR

This paper tackles data scarcity in iterative language-based image editing (ILBIE) by introducing Self-Supervised Counterfactual Reasoning (SSCR). SSCR enables the editor to anticipate outcomes under counterfactual instructions and uses Cross-Task Consistency (CTC) with an Iterative Explainer to provide explicit token-level supervision without ground-truth counterfactuals. An Iterative Editor, guided by instruction histories, is trained with GAN losses, while an Instruction Intervention pipeline generates diverse counterfactuals. Across i-CLEVR and CoDraw, SSCR achieves state-of-the-art results in object identity and position and demonstrates strong data-efficiency, including parity with full-data baselines at 50% data. The approach is model-agnostic and can be integrated with different editors for real-world language-based image editing.

Abstract

Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative instructions to edit images step by step. Data scarcity is a significant issue for ILBIE as it is challenging to collect large-scale examples of images before and after instruction-based changes. However, humans still accomplish these editing tasks even when presented with an unfamiliar image-instruction pair. Such ability results from counterfactual thinking and the ability to think about alternatives to events that have happened already. In this paper, we introduce a Self-Supervised Counterfactual Reasoning (SSCR) framework that incorporates counterfactual thinking to overcome data scarcity. SSCR allows the model to consider out-of-distribution instructions paired with previous images. With the help of cross-task consistency (CTC), we train these counterfactual instructions in a self-supervised scenario. Extensive results show that SSCR improves the correctness of ILBIE in terms of both object identity and position, establishing a new state of the art (SOTA) on two IBLIE datasets (i-CLEVR and CoDraw). Even with only 50% of the training data, SSCR achieves a comparable result to using complete data.

Paper Structure

This paper contains 25 sections, 16 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: An example of the iterative language-based image editing (ILBIE) task. During each turn, the model edits the image from the previous turn based on the current instruction. Eventually, a desired image is accomplished after iterative editing. Note that the generation is at the pixel level.
  • Figure 2: An overview of our self-supervised counterfactual reasoning (SSCR). The iterative editor modifies an image based on current instruction and editing history. Counterfactual reasoning allows the model to think about various counterfactual instructions that can improve the generalizability and deal with data scarcity. Since there are no ground-truth images, we propose cross-task consistency (CTC) to not only provide explicit training signal but also train these counterfactual instructions self-supervisedly.
  • Figure 3: The architecture of our iterative explainer. We consider the previous-resulting image pair and the encoded instruction history as input to reconstruct the editing instruction by an attention-based GRU decoder.
  • Figure 4: Result comparison among baseline, with only cross-task consistency (CTC only), and with whole self-supervised counterfactual reasoning (SSCR) under different ratios of training data. Note that the iterative explainer is also pre-trained using the same available data for each result.
  • Figure 5: The learning curve of training loss provided from the discriminator ($L_G$) and our iterative explainer ($L_E$) on i-CLEVR.
  • ...and 3 more figures