Twin Co-Adaptive Dialogue for Progressive Image Generation
Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Hongyang He, Wenyu Zhu, Xinhang Yuan, Kuan Lu, Menghao Huo, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang
TL;DR
Twin-Co introduces a dual-path co-adaptive framework that interleaves explicit multi-turn dialogue with an internal reflective optimization to progressively align text-to-image outputs with user intent. The Explicit Dialogue Pathway actively refines prompts through a GPT-4–based summarizer, while the Implicit Optimization Pathway uses D3PO and Attend-and-Excite alongside CLIP-guided ambiguity assessment to internally steer generation, with training anchored by 2000 supervised image–text pairs from ImageReward. Empirical results across general and fashion-generation tasks show Twin-Co achieves superior prompt–intent and image–intent alignment (e.g., T2I CLIP $0.338$, I2I CLIP $0.812$, human voting $33.6\%$) and reduced user iterations compared to baselines. The work demonstrates that combining explicit human-in-the-loop feedback with robust internal optimization yields faster convergence to high-quality, user-aligned visuals, enabling more intuitive and efficient interactive image synthesis across domains.
Abstract
Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.
