Table of Contents
Fetching ...

Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang

TL;DR

This work tackles the under-specification problem in text-to-image generation by introducing proactive T2I agents that maintain a belief graph of user intent, ask targeted clarification questions, and present uncertainty in an editable visual form. The authors develop three modular agent prototypes and an automatic evaluation pipeline using simulated ground-truth intents across COCO-Captions, ImageInWords, and DesignBench, achieving at least a 2× improvement in $VQAScore$ over single-turn baselines. Empirical results from automatic metrics and human studies show that such agents can significantly reduce user iterations, produce more accurate images, and provide valuable transparency through belief graphs. The findings suggest that proactive information gathering and interpretable belief representations can make generative AI more controllable, safer, and accessible to diverse users. The work also opens avenues for future work on end-to-end generation from belief graphs and fine-tuning multimodal models on interactive trajectories.

Abstract

User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024), COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found at https://github.com/google-deepmind/proactive_t2i_agents.

Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

TL;DR

This work tackles the under-specification problem in text-to-image generation by introducing proactive T2I agents that maintain a belief graph of user intent, ask targeted clarification questions, and present uncertainty in an editable visual form. The authors develop three modular agent prototypes and an automatic evaluation pipeline using simulated ground-truth intents across COCO-Captions, ImageInWords, and DesignBench, achieving at least a 2× improvement in over single-turn baselines. Empirical results from automatic metrics and human studies show that such agents can significantly reduce user iterations, produce more accurate images, and provide valuable transparency through belief graphs. The findings suggest that proactive information gathering and interpretable belief representations can make generative AI more controllable, safer, and accessible to diverse users. The work also opens avenues for future work on end-to-end generation from belief graphs and fine-tuning multimodal models on interactive trajectories.

Abstract

User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024), COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found at https://github.com/google-deepmind/proactive_t2i_agents.

Paper Structure

This paper contains 49 sections, 18 figures, 5 tables, 2 algorithms.

Figures (18)

  • Figure 1: Our proactive T2I agent clarifies a user prompt with questions, incorporates user feedback, and expresses its uncertainty and understanding as an editable belief graph.
  • Figure 2: a) Each column displays the output of an agent after 15 turns - the right most column shows target image, which belongs to DesignBench. b) A visualization of the multi-turn evaluation set up in the experiments. These are real generated outputs and simulated user outputs at turns 3, 10 and 15.
  • Figure 3: ImageInWords results, including (a) T2T, (b) I2I, (c) T2I, (d) NLL scores. Agents outperform standard setups of T2I and trend to increase performance up to 10 turns.
  • Figure 4: Human study results on the dialogues between agents and simulated users. (a) Issues in each agent's questions, as determined by human raters. (b) Ratings of how well the final generated image corresponds to the user prompt and dialogue.
  • Figure 5: Real multi-turn dialogs generated by the Ag1, Ag2, and Ag3 agents on an image from DesignBench. The figure additionally shows the image generated after the 5 turn dialog per agent.
  • ...and 13 more figures