FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Liyao Jiang; Negar Hassanpour; Mohammad Salameh; Mohan Sai Singamsetti; Fengyu Sun; Wei Lu; Di Niu

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohan Sai Singamsetti, Fengyu Sun, Wei Lu, Di Niu

TL;DR

FRAP tackles prompt-image faithfulness in diffusion-based text-to-image generation by adaptively reweighting per-token prompts during the reverse diffusion process. It defines a unified CA-based objective, L_t = L_presence,t - λ L_binding,t, where L_presence reinforces object presence and L_binding enforces correct object-modifier binding, with per-token weights updated via gradient steps after each time-step. By avoiding latent-code optimization and using spaCy-based token extraction to identify objects and modifiers, FRAP achieves higher prompt-image alignment and image authenticity (CLIP-IQA-Real) on complex prompts and reduces latency relative to prior latent-code methods. FRAP also integrates with LLM-based prompt rewriting to recover degraded alignment and extends to SDXL, offering a practical, inference-time solution for faithful and realistic T2I generation.

Abstract

Text-to-image (T2I) diffusion models have demonstrated impressive capabilities in generating high-quality images given a text prompt. However, ensuring the prompt-image alignment remains a considerable challenge, i.e., generating images that faithfully align with the prompt's semantics. Recent works attempt to improve the faithfulness by optimizing the latent code, which potentially could cause the latent code to go out-of-distribution and thus produce unrealistic images. In this paper, we propose FRAP, a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images. We design an online algorithm to adaptively update each token's weight coefficient, which is achieved by minimizing a unified objective function that encourages object presence and the binding of object-modifier pairs. Through extensive evaluations, we show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets, while having a lower average latency compared to recent latent code optimization methods, e.g., 4 seconds faster than D&B on the COCO-Subject dataset. Furthermore, through visual comparisons and evaluation of the CLIP-IQA-Real metric, we show that FRAP not only improves prompt-image alignment but also generates more authentic images with realistic appearances. We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment, where we observe improvements in both prompt-image alignment and image quality. We release the code at the following link: https://github.com/LiyaoJiang1998/FRAP/.

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

TL;DR

Abstract

Paper Structure (51 sections, 10 equations, 14 figures, 15 tables, 1 algorithm)

This paper contains 51 sections, 10 equations, 14 figures, 15 tables, 1 algorithm.

Introduction
Preliminaries
Stable Diffusion (SD)
UNet
Cross-Attention (CA)
Text Encoder
Prompt Weighting
Related works
Method
Unified Objective
Object Presence Loss
Object-Modifier Binding Loss
Adaptive Prompt Weighting
Experiments
Prompt-Image Alignment, Overall Image Quality, and Image Authenticity
...and 36 more sections

Figures (14)

Figure 1: Method overview. We perform adaptive updates to the per-token prompt weight coefficients to improve faithfulness. The updates are guided by our unified loss function, defined on the CA maps, which strengthens object presence and encourages object-modifier binding. FRAP is capable of correctly generating every object along with accurate object-modifier bindings.
Figure 2: Qualitative comparison of SD, training-based LLM prompt rewrite method Promptist hao2023optimizing, applying FRAP on Promptist rewritten prompt, and FRAP, on the complex Color-Object-Scene dataset. For each prompt, we show images generated by all four methods (using the same set of seeds). The object tokens are highlighted in green and modifier tokens are highlighted in blue.
Figure 3: Qualitative comparison. Stable Diffusion rombach2022high often exhibits both issues of (i) catastrophic neglect, and (ii) incorrect attribute binding. A&E chefer2023attend and D&B li2023divide directly modify the latent code to improve faithfulness. However, the latent code could go out-of-distribution after several iterative updates, resulting in lower quality and less realistic images despite improving its semantics. Our FRAP method can generate images that are faithful to the prompt while maintaining a realistic appearance and aesthetics.
Figure 4: Qualitative comparison on prompts from different datasets (more in the appendix). For each prompt, we show images generated by all four methods (using the same set of seeds). The object tokens are highlighted in green and modifier tokens are highlighted in blue.
Figure 5: Human evaluation interface. See detailed results in Sec. \ref{['subsec:human_eval']}.
...and 9 more figures

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

TL;DR

Abstract

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Authors

TL;DR

Abstract

Table of Contents

Figures (14)