Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

Jingtao Zhan; Qingyao Ai; Yiqun Liu; Jia Chen; Shaoping Ma

Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jia Chen, Shaoping Ma

TL;DR

This work addresses the challenge of crafting high-quality prompts for text-to-image generation by learning from user interaction logs. It introduces CAPR, a capability-aware reformulation framework that decomposes into a Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF), enabling prompts to be rewritten according to user capability and controllable quality targets. By grounding training in real-world reformulation data and employing Bayesian optimization for capability tuning, CAPR outperforms baselines on both seen and unseen text-to-image systems, validating its robustness and practical impact. The approach promises more accessible, high-quality artistic generation by adapting to diverse user skill levels and providing configurable guidance during inference.

Abstract

Text-to-image generation systems have emerged as revolutionary tools in the realm of artistic creation, offering unprecedented ease in transforming textual prompts into visual art. However, the efficacy of these systems is intricately linked to the quality of user-provided prompts, which often poses a challenge to users unfamiliar with prompt crafting. This paper addresses this challenge by leveraging user reformulation data from interaction logs to develop an automatic prompt reformulation model. Our in-depth analysis of these logs reveals that user prompt reformulation is heavily dependent on the individual user's capability, resulting in significant variance in the quality of reformulation pairs. To effectively use this data for training, we introduce the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively integrates user capability into the reformulation process through two key components: the Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). CRM reformulates prompts according to a specified user capability, as represented by CCF. The CCF, in turn, offers the flexibility to tune and guide the CRM's behavior. This enables CAPR to effectively learn diverse reformulation strategies across various user capacities and to simulate high-capability user reformulation during inference. Extensive experiments on standard text-to-image generation benchmarks showcase CAPR's superior performance over existing baselines and its remarkable robustness on unseen systems. Furthermore, comprehensive analyses validate the effectiveness of different components. CAPR can facilitate user-friendly interaction with text-to-image systems and make advanced artistic creation more achievable for a broader range of users.

Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (35 sections, 5 equations, 6 figures, 4 tables)

This paper contains 35 sections, 5 equations, 6 figures, 4 tables.

Introduction
Related Work
Text-to-Image Generation
Query Reformulation
Text-to-Image Prompts
Problem Formulation
Analysis of Prompt Reformulation
Reformulation: Prompt vs. Query
Investigation of Prompt Session Log
Dataset
Analysis Methodology
Empirical Findings
Methodology
Model Architecture
Conditional Reformulation Model (CRM)
...and 20 more sections

Figures (6)

Figure 1: Comparing prompt reformulation with query reformulation in terms of three key factors: ① the initial input ② user's understanding of the system's mechanics ③ the previous system's output. The latter two can hardly help users reformulate better prompts, indicating that prompt reformulation is a more challenging task for users.
Figure 2: Comparison of generation quality between initial and reformulated prompts within a session, evaluated by ImageReward xu2023imagereward and Aesthetic scoring models aesthetic_predictor. Results reveal limited quality improvement through users' reformulation, suggesting that prompt quality largely depends on the user's initial capabilities. Session contexts such as generation feedback usually offer limited assistance.
Figure 3: Architecture of Capability-aware Prompt Reformulation (CAPR). It consists of two components: the Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). Given a certain user capability indicated by CCF, CRM reformulates prompts accordingly.
Figure 4: Training process of Capability-aware Prompt Reformulation (CAPR). Configurable Capability Features (CCF) is computed based on the training pairs, and Conditional Reformulation Model (CRM) is trained to predict the reformulated prompt given the initial prompt and CCF.
Figure 5: Configuration of Configurable Capability Features (CCF). The Conditional Reformulation Model (CRM) has been trained and is frozen. Within CCF, the overall quality metric is set to the highest, and other features of CCF are tuned to maximize the generation quality. The tuning process is accelerated with Bayesian optimization.
...and 1 more figures

Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

TL;DR

Abstract

Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)