Table of Contents
Fetching ...

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, Xiaoshuai Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang, Rongrong Ji

TL;DR

This work tackles the gap between concise prompts and faithful image synthesis in text-to-image generation by introducing RePrompt, a reinforcement-learning-based reprompting framework that injects explicit reasoning into prompt construction. By decoupling prompt generation from image synthesis and training a prompting policy with a multi-faceted reward (visual realism, semantic alignment, and structured reasoning format), RePrompt achieves state-of-the-art compositional grounding across GenEval and T2I-Compbench while remaining backbone-agnostic. The approach yields substantial gains in spatial reasoning and attribute binding, with significantly lower latency than optimization-heavy baselines. The results demonstrate that structured, reasoning-guided prompts can robustly improve downstream visual fidelity without additional human annotations or re-training of the T2I model.

Abstract

Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

TL;DR

This work tackles the gap between concise prompts and faithful image synthesis in text-to-image generation by introducing RePrompt, a reinforcement-learning-based reprompting framework that injects explicit reasoning into prompt construction. By decoupling prompt generation from image synthesis and training a prompting policy with a multi-faceted reward (visual realism, semantic alignment, and structured reasoning format), RePrompt achieves state-of-the-art compositional grounding across GenEval and T2I-Compbench while remaining backbone-agnostic. The approach yields substantial gains in spatial reasoning and attribute binding, with significantly lower latency than optimization-heavy baselines. The results demonstrate that structured, reasoning-guided prompts can robustly improve downstream visual fidelity without additional human annotations or re-training of the T2I model.

Abstract

Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.

Paper Structure

This paper contains 42 sections, 1 theorem, 14 equations, 11 figures, 7 tables.

Key Result

Theorem A.1

Figures (11)

  • Figure 1: Given the user prompt "a photo of a couch below a vase", existing models like DELL-E3 generate rich language descriptions but often produce unrealistic or physically implausible compositions. In contrast, our RePrompt performs explicit chain-of-thought reasoning to resolve spatial relations, resulting in enhanced prompts that guide text-to-image models towards realistic and semantically aligned generations.
  • Figure 2: Overview of the proposed RePrompt. For each input prompt, RePrompt generates multiple reasoning trace and enhanced prompt pairs. The reasoning trace guides the model to produce more detailed, image-grounded prompts. These are used to synthesize candidate images via a T2I model, which are then scored by a reward model. Feedback is used to update RePrompt via GRPO.
  • Figure 3: The Visual-Reasoning Reward.
  • Figure 4: Impact of our method across different base T2I models on the GenEval benchmark. Our method consistently improves the compositional understanding across all base models.
  • Figure 5: Qualitative results on compositional prompts. Compared to vanilla T2I models, our RePrompt improves spatial layout and object relations by generating enhanced prompts with explicit reasoning, leading to more faithful compositions.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem A.1: Variance Reduction
  • proof