Table of Contents
Fetching ...

Twin Co-Adaptive Dialogue for Progressive Image Generation

Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Hongyang He, Wenyu Zhu, Xinhang Yuan, Kuan Lu, Menghao Huo, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang

TL;DR

Twin-Co introduces a dual-path co-adaptive framework that interleaves explicit multi-turn dialogue with an internal reflective optimization to progressively align text-to-image outputs with user intent. The Explicit Dialogue Pathway actively refines prompts through a GPT-4–based summarizer, while the Implicit Optimization Pathway uses D3PO and Attend-and-Excite alongside CLIP-guided ambiguity assessment to internally steer generation, with training anchored by 2000 supervised image–text pairs from ImageReward. Empirical results across general and fashion-generation tasks show Twin-Co achieves superior prompt–intent and image–intent alignment (e.g., T2I CLIP $0.338$, I2I CLIP $0.812$, human voting $33.6\%$) and reduced user iterations compared to baselines. The work demonstrates that combining explicit human-in-the-loop feedback with robust internal optimization yields faster convergence to high-quality, user-aligned visuals, enabling more intuitive and efficient interactive image synthesis across domains.

Abstract

Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.

Twin Co-Adaptive Dialogue for Progressive Image Generation

TL;DR

Twin-Co introduces a dual-path co-adaptive framework that interleaves explicit multi-turn dialogue with an internal reflective optimization to progressively align text-to-image outputs with user intent. The Explicit Dialogue Pathway actively refines prompts through a GPT-4–based summarizer, while the Implicit Optimization Pathway uses D3PO and Attend-and-Excite alongside CLIP-guided ambiguity assessment to internally steer generation, with training anchored by 2000 supervised image–text pairs from ImageReward. Empirical results across general and fashion-generation tasks show Twin-Co achieves superior prompt–intent and image–intent alignment (e.g., T2I CLIP , I2I CLIP , human voting ) and reduced user iterations compared to baselines. The work demonstrates that combining explicit human-in-the-loop feedback with robust internal optimization yields faster convergence to high-quality, user-aligned visuals, enabling more intuitive and efficient interactive image synthesis across domains.

Abstract

Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.

Paper Structure

This paper contains 28 sections, 4 theorems, 95 equations, 14 figures, 3 tables.

Key Result

Theorem E.5

Let $s_t = \sigma(I_t, I^*)$ be the CLIP similarity between the generated image $I_t$ and the target $I^*$ at round $t$. Under the above assumptions: (1) The sequence $\{s_t\}$ is probabilistically monotonically increasing, that is: (2) $\{s_t\}$ converges in probability to 1:

Figures (14)

  • Figure 1: Comparison between conventional multi-turn image generation and our proposed Twin-Co framework. Traditional approaches rely solely on iterative prompt refinement and often require more dialogue rounds to align with user intent. In contrast, Twin-Co integrates both explicit user feedback and implicit reflection mechanisms.
  • Figure 2: Overview of the Twin-Co training framework. Top: In the Explicit Dialogue Pathway (Round 1), the summarizer generates a prompt from dialogue history, which is used by the generative model to produce candidate images. The user provides feedback indicating preference. Middle: The Implicit Optimization Pathway leverages preference pairs for D3PO-based reward optimization and uses a captioner to extract semantic concepts from generated images. An ambiguity score is computed to determine whether clarification is needed, and Attend-and-Excite (A&E) is applied to reactivate under-attended prompt tokens. Bottom: In Round 2, the summarizer incorporates the user's clarification (e.g., "riding a bike at sunset") into the prompt, guiding the model toward more accurate image generation. This Twin-Co process is iteratively repeated over multi-turn dialogues to progressively align with user intent.
  • Figure 3: Comparison of cherry blossom tea images generated across four dialogue rounds by various models.
  • Figure 4: t-SNE visualization of prompt embeddings across three dialogue rounds.
  • Figure 5: Heatmap showing user perception of intent capture across dialogue rounds. The intensity peaks around the third round.
  • ...and 9 more figures

Theorems & Definitions (9)

  • Definition E.1
  • Theorem E.5: Monotonicity and Convergence in Probability
  • proof
  • Lemma E.6: Lipschitz Property of Sigmoid Function
  • proof
  • Theorem E.9: Generalization Bound for DPO Preference Optimization
  • proof
  • Theorem E.13: Effectiveness of Gradient Correction in Attend-and-Excite
  • proof