Table of Contents
Fetching ...

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Seungwook Kim, Minsu Cho

TL;DR

Adaptive Rewarding by self-Confidence is introduced, a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes, enabling fully unsupervised optimization without additional datasets, annotators, or reward models.

Abstract

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

TL;DR

Adaptive Rewarding by self-Confidence is introduced, a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes, enabling fully unsupervised optimization without additional datasets, annotators, or reward models.

Abstract

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
Paper Structure (32 sections, 14 equations, 8 figures, 6 tables)

This paper contains 32 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Quantitative results of ARC. We evaluate ARC post-training on base models of SD3.5 esser2024scaling, on GenEval ghosh2023geneval, Text Rendering, human preference reward models kirstain2023pickwu2023humanxu2023imagerewardwang2025unified, and image quality metrics. We show that ARC post-training yields consistent gains across text-to-image generative models on different quantitative metrics.
  • Figure 2: Overview of ARC. Given a text prompt $c$, we generate $G$ different latents. Without decoding, we re-noise the latents using $K$ noise probes across $t\in\mathcal{T}\subset[0,1]$. For each generated latent $z_0^{(i)}$, we formulate the text-to-image generative model's self-confidence of the generated latent as the ability to denoise the re-noised latent. We leverage this self-confidence as an internal reward scalar value, which we use to post-train the text-to-image generative model using GRPO shao2024deepseekmathgrpoliu2025flowgrpo. We omit the KL term in this figure for better readability.
  • Figure 4: Effect of ARC post-training SD3.5-M after post-training on PickScore kirstain2023pick using FlowGRPO liu2025flowgrpo. ARC complements external rewards, showing the best best compositional generation and visual appeal on GenEval ghosh2023geneval. Post-training on external rewards yields high visual appeal, but sacrifices compositionality as shown above (Column 3: Generates yellow motorcycle instead / generates unwanted human).
  • Figure 5: Qualitative results of ARC when applied on SD3.5 esser2024scaling on DrawBench saharia2022photorealistic, GenEval ghosh2023geneval and OCR cui2025paddleocr. It can be seen that applying ARC shows consistent improvements over the baseline SD3.5.
  • Figure 6: Rationale of ARC. Distributions of the denoising-based self-confidence under three inference settings—$10$ steps (no CFG), $10$ steps (CFG), and $20$ steps (CFG). The distribution shifts monotonically rightward (higher self-confidence) in the same order that visual quality improves, indicating that the ability to recover injected noise is predictive of sample quality even when the scorer is the same model. This alignment underpins ARC’s use of self-confidence as an intrinsic reward.
  • ...and 3 more figures