Table of Contents
Fetching ...

Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance

Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao, Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang, Boxing Chen, Osamu Yoshie

TL;DR

DialPrompt tackles novice users' difficulty in text-to-image prompt writing by employing a multi-turn dialogue that guides preference elicitation across 15 essential prompt dimensions. It constructs MTGPD, a dataset of over 500 multi-turn dialogues, and trains via multi-turn supervised fine-tuning with a masking strategy to learn step-by-step guidance. Experimental results show DialPrompt yields superior user-centricity scores and competitive image fidelity and aesthetics, with strong transfer across multiple TIS models and baselines. The work advances user-centric AI for TIS by enabling interpretable, interactive prompt synthesis and provides an open dataset for future research.

Abstract

The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models are sensitive on textual prompts, posing a challenge for novice users who may not be familiar with TIS prompt writing. Existing solutions relieve this via automatic prompt expansion or generation from a user query. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. Thus, we propose DialPrompt, a dialogue-based TIS prompt generation model that emphasizes user experience for novice users. DialPrompt is designed to follow a multi-turn workflow, where in each round of dialogue the model guides user to express their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt improves user-centricity by allowing users to perceive and control the creation process of TIS prompts. Experiments indicate that DialPrompt improves significantly in user-centricity score compared with existing approaches while maintaining a competitive quality of synthesized images. In our user evaluation, DialPrompt is highly rated by 19 human reviewers (especially novices).

Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance

TL;DR

DialPrompt tackles novice users' difficulty in text-to-image prompt writing by employing a multi-turn dialogue that guides preference elicitation across 15 essential prompt dimensions. It constructs MTGPD, a dataset of over 500 multi-turn dialogues, and trains via multi-turn supervised fine-tuning with a masking strategy to learn step-by-step guidance. Experimental results show DialPrompt yields superior user-centricity scores and competitive image fidelity and aesthetics, with strong transfer across multiple TIS models and baselines. The work advances user-centric AI for TIS by enabling interpretable, interactive prompt synthesis and provides an open dataset for future research.

Abstract

The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models are sensitive on textual prompts, posing a challenge for novice users who may not be familiar with TIS prompt writing. Existing solutions relieve this via automatic prompt expansion or generation from a user query. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. Thus, we propose DialPrompt, a dialogue-based TIS prompt generation model that emphasizes user experience for novice users. DialPrompt is designed to follow a multi-turn workflow, where in each round of dialogue the model guides user to express their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt improves user-centricity by allowing users to perceive and control the creation process of TIS prompts. Experiments indicate that DialPrompt improves significantly in user-centricity score compared with existing approaches while maintaining a competitive quality of synthesized images. In our user evaluation, DialPrompt is highly rated by 19 human reviewers (especially novices).
Paper Structure (32 sections, 8 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Two user cases of TIS prompt generation with (a) single-turn style and (b) multi-turn guidance style.
  • Figure 2: Occurrence distribution of 15 extracted dimensions in 5k advanced TIS prompts.
  • Figure 3: Illustration on the dataset construction, training and inference of DialPrompt.
  • Figure :
  • Figure :
  • ...and 3 more figures