Instilling Multi-round Thinking to Text-guided Image Generation

Lidong Zeng; Zhedong Zheng; Yinwei Wei; Tat-seng Chua

Instilling Multi-round Thinking to Text-guided Image Generation

Lidong Zeng, Zhedong Zheng, Yinwei Wei, Tat-seng Chua

TL;DR

This work tackles the gap in text-guided image editing where single-round generation fails to capture fine-grained details across multi-round interactions. It introduces a self-supervised multi-round regularization that enforces order-invariant consistency within a diffusion-based framework, leveraging error amplification to improve local edit fidelity. The method integrates with existing models via the total loss $L_{total} = L_{single} + L_{recon} + \lambda L_{multi}$, with a dynamic schedule that shifts emphasis from multi-round to single-round generation during training. Experiments on FashionIQ and Fashion200k show improvements in FID and semantic alignment (CLIP scores and Recall@K), and demonstrate robustness to ill-formed text, indicating stronger generalization for iterative, real-world editing tasks.

Abstract

This paper delves into the text-guided image editing task, focusing on modifying a reference image according to user-specified textual feedback to embody specific attributes. Despite recent advancements, a persistent challenge remains that the single-round generation often overlooks crucial details, particularly in the realm of fine-grained changes like shoes or sleeves. This issue compounds over multiple rounds of interaction, severely limiting customization quality. In an attempt to address this challenge, we introduce a new self-supervised regularization, \ie, multi-round regularization, which is compatible with existing methods. Specifically, the multi-round regularization encourages the model to maintain consistency across different modification orders. It builds upon the observation that the modification order generally should not affect the final result. Different from traditional one-round generation, the mechanism underpinning the proposed method is the error amplification of initially minor inaccuracies in capturing intricate details. Qualitative and quantitative experiments affirm that the proposed method achieves high-fidelity editing quality, especially the local modification, in both single-round and multiple-round generation, while also showcasing robust generalization to irregular text inputs. The effectiveness of our semantic alignment with textual feedback is further substantiated by the retrieval improvements on FahisonIQ and Fashion200k.

Instilling Multi-round Thinking to Text-guided Image Generation

TL;DR

, with a dynamic schedule that shifts emphasis from multi-round to single-round generation during training. Experiments on FashionIQ and Fashion200k show improvements in FID and semantic alignment (CLIP scores and Recall@K), and demonstrate robustness to ill-formed text, indicating stronger generalization for iterative, real-world editing tasks.

Abstract

Paper Structure (12 sections, 6 equations, 6 figures, 3 tables)

This paper contains 12 sections, 6 equations, 6 figures, 3 tables.

Introduction
Related work
Method
Multi-round Learning
Single-round Learning
Optimization
Experiment
Datasets and Evaluation Metrics
Implementation Details
Comparison with SOTA
Ablation Studies and Further Analysis
Conclusion

Figures (6)

Figure 1: (a) A typical use case of multi-round interactive editing. The learned model can understand text instruction and the semantic meaning of images and craft images based on previous user feedback. This real-world scenario often involves multi-round generation rather than single-round generation. (b) Some common failure cases on the prevailing methods controlnet, i.e., long sentence ignorance case, multi-facet forgetting case. We could observe the significant visual difference between generated images and ground-truth targets. (c) Here we show a typical two-round inconsistency case. The final generated results are sensitive to the order of text guidance.
Figure 2: A schematic overview of our framework. (a) Multi-round Generation: The multi-round regularization is achieved by a skip loss only supervising on final output $\tilde{z}^y_0$ and ground truth $z^y_0$. Starting with the encoding of the reference image $x$ into a latent embedding $z^x_0$, we conduct a complete denoising process twice to get $\tilde{z}^y_0$. The information of $x$ is traced by the blue line, while the pink line indicates the flow of text information. (b) Single-round Reconstruction: In single-round reconstruction, the objective is to reconstruct the target image $y$ by denoising on perturbed ground truth $y$ alongside the corresponding text condition $T$ and pose $P$. (c) A brief illustration of Pose-Conditioned Diffusion. Given the input $c$, $P$ and $z_\tau$, the diffusion is to generate $\tilde{z}_0$ via iteratively denosing on the time step $\tau$. We adopt an extra LoRA layer, which is concatenated into each attention block of the U-net decoder.
Figure 3: Comparison between Stable Diffusion (SD), ControlNet, baseline, and our proposed method in the single-round generation. The red dashed box indicates the mismatch areas between the generated results and the corresponding text sentences, respectively. The baseline shares the same structure and settings except for multi-round loss $\mathcal{L}_{multi}$. We could observe the baseline sometimes miss keywords, i.e., "ruffles" (the 3rd row), and over-modify "casual" in the (the 4th row), while the proposed method could notice such descriptive words.
Figure 4: Qualitative comparison between Stable Diffusion, ControlNet, our baseline model without multi-round constraint, and our proposed model. Starting from the reference image, each generation is based on previous results. We could observe that our method generates reasonable results, while better preserving the style of reference image.
Figure 5: Visualization of generation under ill-formed text condition corresponding to Table \ref{['tab:swap']}. Comparing to Stable Diffusion, ControlNet and our baseline, the proposed method is robust to different ill-formed texts, e.g., swapping the sentence order, rotating the word order, and masking words. The output is consistent with the corresponding text, e.g., "tighter" and "ruffled".
...and 1 more figures

Instilling Multi-round Thinking to Text-guided Image Generation

TL;DR

Abstract

Instilling Multi-round Thinking to Text-guided Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)