Table of Contents
Fetching ...

PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Donghao Li, Tiankai Hang, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, Xin Li, Mingrui Wu, Xinchi Deng, Shuyang Gu, Chunyu Wang, Qinglin Lu

TL;DR

PromptEnhancer tackles the prompt alignment gap in text-to-image generation by introducing a universal, model-agnostic prompt rewriter that uses Chain-of-Thought reasoning. The rewriter is trained in two stages—supervised fine-tuning and policy alignment guided by a multi-faceted AlignEvaluator with 24 key points—allowing it to generate prompts that better elicit faithful images from frozen T2I models. The approach demonstrates broad improvements in image-text alignment on a challenging benchmark (T2I-KeyPoints-Align) and introduces a large-scale SFT RL data pipeline and a separate RL prompt set to ensure robust generalization. Together, these contributions enable accurate, detailed, and stylistically diverse image synthesis while providing a new human-aligned evaluation resource for the community.

Abstract

Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

TL;DR

PromptEnhancer tackles the prompt alignment gap in text-to-image generation by introducing a universal, model-agnostic prompt rewriter that uses Chain-of-Thought reasoning. The rewriter is trained in two stages—supervised fine-tuning and policy alignment guided by a multi-faceted AlignEvaluator with 24 key points—allowing it to generate prompts that better elicit faithful images from frozen T2I models. The approach demonstrates broad improvements in image-text alignment on a challenging benchmark (T2I-KeyPoints-Align) and introduces a large-scale SFT RL data pipeline and a separate RL prompt set to ensure robust generalization. Together, these contributions enable accurate, detailed, and stylistically diverse image synthesis while providing a new human-aligned evaluation resource for the community.

Abstract

Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

Paper Structure

This paper contains 21 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: PromptEnhancer enables high-fidelity and stylistically diverse image generation from user prompts. Using HunyuanImage 2.1 as the base T2I model, our method demonstrates its versatility across various domains, including photorealism, digital art, abstract geometry, and multilingual text-in-image generation. The examples showcase how minimal user inputs are transformed into rich, detailed prompts that yield high-quality visual outputs, bridging the gap between user intent and model execution.
  • Figure 2: An overview of the training framework for PromptEnhancer. Our framework trains a universal Rewriter to enhance pretrained Text-to-Image (T2I) model without altering its weights. This is achieved through a two-stage process guided by a specialized reward model. Stage 1: SFT for Rewriter Initialization (Sec \ref{['sec:stage1_sft4cot']}). The CoT Rewriter is first initialized via SFT on (user prompt, reprompt) pairs. This stage teaches the model to generate structured, chain-of-thought style responses using a standard next-token prediction loss, establishing a strong foundation for refinement. Stage 2: Policy Alignment with GRPO (Sec \ref{['sec:stage2_grpo']}). The initialized rewriter is further refined using GRPO. The rewriter generates multiple reprompt candidates, which are used by a frozen T2I model to create images. The pre-trained AlignEvaluator then assesses each (image, user prompt) pair and provides a scalar reward. This reward signal optimizes the rewriter's policy, steering it toward generating prompts that maximize the alignment between the image and the user's intent. The AlignEvaluator (Sec \ref{['sec:alignevaluator']}). Central to our framework is the AlignEvaluator, a pre-trained reward model. It is trained on a large-scale dataset annotated against a taxonomy of 24 fine-grained key points (T2I-KeyPoints, Tab. \ref{['tab:keypoints']}). This enables it to provide a robust and nuanced reward signal, which is crucial for the policy alignment stage.
  • Figure 3: Overview of the construction and filtering pipeline for the PromptEnhancer training data. The process involves user prompt simulation, Gemini-based generation, human-in-the-loop selection, and automated filtering to ensure high quality.
  • Figure 4: Distribution of Categories in the Dataset. The chart on the left shows the primary categories, while the chart on the right provides a detailed breakdown into 20 sub-categories.
  • Figure 5: Distribution of evaluation dimensions in our dataset. (a) The detailed percentage of each of the 24 fine-grained KeyPoints, sorted in descending order. (b) The aggregated percentage for each of the six main Super-Categories, calculated by summing the percentages of their constituent KeyPoints. In both charts, colors represent the Super-Category, visually linking the detailed points to their broader classification.
  • ...and 6 more figures