Table of Contents
Fetching ...

RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

Junyan Ye, Leiqi Zhu, Yuncheng Guo, Dongzhi Jiang, Zilong Huang, Yifan Zhang, Zhiyuan Yan, Haohuan Fu, Conghui He, Weijia Li

TL;DR

RealGen tackles the photorealism gap in text-to-image generation by integrating an LLM-based prompt optimizer with a diffusion-based generator and guiding both components with detector-based rewards. The authors introduce a multi-objective detector reward combining semantic and feature-level detectors and train the system via GRPO, followed by a two-stage post-training. They also propose RealBench, an automated benchmark with Detector-Scoring and Arena-Scoring to assess realism without human annotations. Experiments show RealGen outperforms strong general and photorealistic baselines on realism, detail, and aesthetics, with strong generalization to held-out detectors and real-image comparisons.

Abstract

With the continuous advancement of image generation technology, advanced models such as GPT-Image-1 and Qwen-Image have achieved remarkable text-to-image consistency and world knowledge However, these models still fall short in photorealistic image generation. Even on simple T2I tasks, they tend to produce " fake" images with distinct AI artifacts, often characterized by "overly smooth skin" and "oily facial sheens". To recapture the original goal of "indistinguishable-from-reality" generation, we propose RealGen, a photorealistic text-to-image framework. RealGen integrates an LLM component for prompt optimization and a diffusion model for realistic image generation. Inspired by adversarial generation, RealGen introduces a "Detector Reward" mechanism, which quantifies artifacts and assesses realism using both semantic-level and feature-level synthetic image detectors. We leverage this reward signal with the GRPO algorithm to optimize the entire generation pipeline, significantly enhancing image realism and detail. Furthermore, we propose RealBench, an automated evaluation benchmark employing Detector-Scoring and Arena-Scoring. It enables human-free photorealism assessment, yielding results that are more accurate and aligned with real user experience. Experiments demonstrate that RealGen significantly outperforms general models like GPT-Image-1 and Qwen-Image, as well as specialized photorealistic models like FLUX-Krea, in terms of realism, detail, and aesthetics. The code is available at https://github.com/yejy53/RealGen.

RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

TL;DR

RealGen tackles the photorealism gap in text-to-image generation by integrating an LLM-based prompt optimizer with a diffusion-based generator and guiding both components with detector-based rewards. The authors introduce a multi-objective detector reward combining semantic and feature-level detectors and train the system via GRPO, followed by a two-stage post-training. They also propose RealBench, an automated benchmark with Detector-Scoring and Arena-Scoring to assess realism without human annotations. Experiments show RealGen outperforms strong general and photorealistic baselines on realism, detail, and aesthetics, with strong generalization to held-out detectors and real-image comparisons.

Abstract

With the continuous advancement of image generation technology, advanced models such as GPT-Image-1 and Qwen-Image have achieved remarkable text-to-image consistency and world knowledge However, these models still fall short in photorealistic image generation. Even on simple T2I tasks, they tend to produce " fake" images with distinct AI artifacts, often characterized by "overly smooth skin" and "oily facial sheens". To recapture the original goal of "indistinguishable-from-reality" generation, we propose RealGen, a photorealistic text-to-image framework. RealGen integrates an LLM component for prompt optimization and a diffusion model for realistic image generation. Inspired by adversarial generation, RealGen introduces a "Detector Reward" mechanism, which quantifies artifacts and assesses realism using both semantic-level and feature-level synthetic image detectors. We leverage this reward signal with the GRPO algorithm to optimize the entire generation pipeline, significantly enhancing image realism and detail. Furthermore, we propose RealBench, an automated evaluation benchmark employing Detector-Scoring and Arena-Scoring. It enables human-free photorealism assessment, yielding results that are more accurate and aligned with real user experience. Experiments demonstrate that RealGen significantly outperforms general models like GPT-Image-1 and Qwen-Image, as well as specialized photorealistic models like FLUX-Krea, in terms of realism, detail, and aesthetics. The code is available at https://github.com/yejy53/RealGen.

Paper Structure

This paper contains 18 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Images generated by our proposed RealGen. RealGen achieves superior photorealism and enhanced details, outperforming both powerful T2I models, such as Qwen-Image, and specialized photorealistic models, like FLUX-Krea.
  • Figure 2: From Synthetic Artifacts to Photorealism. Contrasting the common "fake-feel" AI artifacts in previous T2I methods , our proposed RealGen achieves enhanced photorealism by progressively evading semantic and feature-level detectors.
  • Figure 3: Overview of the RealGen Method. (a) The architecture of RealGen, consisting of an LLM component and a Diffusion component. (b) Our detector-based reward model, which evaluates images based on visible artifacts, feature-level artifacts, and text-image alignment. (c) The two-stage post-training process guided by this reward model, which respectively optimizes the LLM and Diffusion components.
  • Figure 4: Overview of the RealBench. The left shows the categorical data composition. The right details its evaluation protocol.
  • Figure 5: Qualitative comparison of different methods on RealBench.
  • ...and 4 more figures