Improving Text-to-Image Consistency via Automatic Prompt Optimization

Oscar Mañas; Pietro Astolfi; Melissa Hall; Candace Ross; Jack Urbanek; Adina Williams; Aishwarya Agrawal; Adriana Romero-Soriano; Michal Drozdzal

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal

TL;DR

This paper tackles the problem of prompt-image mismatch in text-to-image generation by introducing OPT2I, a training-free, optimization-by-prompting framework that uses an LLM to iteratively rewrite user prompts to maximize a prompt-image consistency score. OPT2I operates without fine-tuning the T2I models, leveraging in-context learning and a history of prompt-score pairs to refine prompts and improve consistency across multiple seeds. The approach demonstrates substantial gains on MSCOCO and PartiPrompts datasets (up to 12.2% and 24.9% DSG/dCS improvements, respectively) while preserving or enhancing image quality metrics like FID and recall, and it shows robustness to different LLMs, T2I models, and scoring metrics. The work highlights the potential of LLM-driven, inference-time prompt optimization as a practical path toward more reliable and controllable T2I systems, while acknowledging limitations in scorer reliability and computational cost.

Abstract

Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.

Improving Text-to-Image Consistency via Automatic Prompt Optimization

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 12 figures, 10 tables)

This paper contains 24 sections, 2 equations, 12 figures, 10 tables.

Introduction
OPT2I: Optimization by prompting for T2I
Problem formulation
Meta-prompt design
Optimization objective
Exploration-exploitation trade-off
Experiments
Experimental setting
Main results
Trade-offs with image quality and diversity
Ablations
Post-hoc image selection
Related work
Conclusions
Additional method details
...and 9 more sections

Figures (12)

Figure 1: Overview of our backpropagation-free text-to-image optimization by prompting approach that rewrites user prompts with the goal of improving prompt-image consistency. Our framework is composed of a text-to-image generative model (T2I), a large language model (LLM) and a consistency objective (Scorer). The LLM iteratively leverages a history of prompt-score pairs to suggest revised prompts. In the depicted example, our system improves the consistency score by over 30% in terms of Davidsonian Scene Graph score.
Figure 2: Our prompt optimization framework, OPT2I, composed of (1) a text-to-image (T2I) generative model that generates images from text prompts, (2) a consistency metric that evaluates the fidelity between the generated images and the user prompt, and (3) a large language model (LLM) that leverages task description and a history of prompt-score tuples to provide revised prompts. At the beginning, the revised prompt is initialized with the user prompt.
Figure 3: OPT2I curves with different consistency objectives (dCS vs. DSG), LLMs, and T2I models. Each plot track either the max or the mean relative improvement in consistency across revised prompts per iteration.
Figure 4: Selected qualitative results for prompts from MSCOCO (a-b) and P2 (c-d) datasets, using DSG as consistency metric. For each setup, we display four rows (from the top): initial prompt #1, optimized prompt #1, initial prompt #2, and optimized prompt #2. Each column corresponds to a different T2I model random seed. We report average consistency score across seeds in between parenthesis.
Figure 5: Cumulative max relative dCS as a function of #revised prompts = #iterations $\cdot$ #prompts/iter.
...and 7 more figures

Improving Text-to-Image Consistency via Automatic Prompt Optimization

TL;DR

Abstract

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (12)