Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

Yuheng Feng; Yangfan He; Yinghui Xia; Tianyu Shi; Jun Wang; Jinsong Yang

Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

Yuheng Feng, Yangfan He, Yinghui Xia, Tianyu Shi, Jun Wang, Jinsong Yang

TL;DR

The paper tackles prompt ambiguity in text-to-image generation by introducing Reflective Human-Machine Co-adaptation (RHM-CAS), a dialogue-enabled agent that combines external verbal reflection with internal preference optimization. Externally, the system generates images from prompts formed by a Summarizer, analyzes outputs with a visual-language evaluator, and asks targeted questions to resolve ambiguities, formalized as $P_t = M_S(w_t,h)$, $I_t = M_G(P_t)$, $C_t = M_E(I_t)$, $r_t = M_{inf}(C_t, P_t)$, and $q_{t+1} = M_A(C_t, r_t)$. Internally, it applies Direct Preference Optimization (DPO) via D3PO and Attend-and-Excite (A&E) to learn from user feedback and address neglected content, with a loss characterized by $\mathcal{L}(\theta) = -\mathbb{E}[\log \rho(\beta \log (\pi_\theta(a^w|s^w)/\pi_{ref}(a^w|s^w)) - \beta \log (\pi_\theta(a^l|s^l)/\pi_{ref}(a^l|s^l))) ]$ and $\beta$ controlling deviation. The framework is validated on general image generation and fashion product creation, showing improved alignment to target visuals and stronger user satisfaction, while illustrating the potential of combining external dialogue and internal optimization for user-centric image synthesis. The work advances practical, interactive image-generation systems with broad implications for non-expert users and personalized design workflows.

Abstract

Today's image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users' potential intentions. Consequently, machines need to interact with users multiple rounds to better understand users' intents. The unpredictable costs of using or learning image generation models through multiple feedback interactions hinder their widespread adoption and full performance potential, especially for non-expert users. In this research, we aim to enhance the user-friendliness of our image generation system. To achieve this, we propose a reflective human-machine co-adaptation strategy, named RHM-CAS. Externally, the Agent engages in meaningful language interactions with users to reflect on and refine the generated images. Internally, the Agent tries to optimize the policy based on user preferences, ensuring that the final outcomes closely align with user preferences. Various experiments on different tasks demonstrate the effectiveness of the proposed method.

Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

TL;DR

, and

. Internally, it applies Direct Preference Optimization (DPO) via D3PO and Attend-and-Excite (A&E) to learn from user feedback and address neglected content, with a loss characterized by

and

controlling deviation. The framework is validated on general image generation and fashion product creation, showing improved alignment to target visuals and stronger user satisfaction, while illustrating the potential of combining external dialogue and internal optimization for user-centric image synthesis. The work advances practical, interactive image-generation systems with broad implications for non-expert users and personalized design workflows.

Abstract

Paper Structure (24 sections, 6 equations, 11 figures, 3 tables, 3 algorithms)

This paper contains 24 sections, 6 equations, 11 figures, 3 tables, 3 algorithms.

Introduction
Related work
Proposed method
External Reflection via Verbal Reflection
Internal Reflection via Direct Preference Optimization
Experiment
Task 1 General Image Generation
Setting
Data Collection
Baseline setup
Result Analysis
Task 2 Fashion Product Creation
Setting
Result Analysis
Conclusion
...and 9 more sections

Figures (11)

Figure 1: Proposed framework of Enhanced Text-to-Image Reflexion Agent. The Generation Model can learn user preferences by Direct Preference Optimization.
Figure 2: A comparative display of four rounds of image generation based on specific prompts, including cherry blossom tea, a parrot, a teenage girl, and an Asian temple across different rounds.
Figure 3: Human Voting for Statement: Multi-turn dialogues can approximate the user's potential intents.
Figure 4: This image showcases a diverse collection of fashion models and outfits, segmented by user preferences or data. Each section highlights different styles of attire, including elegant dresses and professional to casual jackets, modeled by individuals of different ethnic backgrounds.
Figure 5: Screenshot of the Q&A software annotation interface.
...and 6 more figures

Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

TL;DR

Abstract

Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

Authors

TL;DR

Abstract

Table of Contents

Figures (11)