Table of Contents
Fetching ...

IRIS: Intrinsic Reward Image Synthesis

Yihang Chen, Yuanhao Ban, Yunqi Hong, Cho-Jui Hsieh

TL;DR

This paper tackles the scarcity of human preference data for training autoregressive text-to-image (T2I) models by introducing IRIS, a framework that uses intrinsic self-uncertainty signals as rewards. Contrary to findings in text-only domains, maximizing self-certainty harms image diversity; instead, NSC-based intrinsic rewards encourage richer, more human-aligned images. IRIS leverages GRPO and semantic Chains of Thought to guide image synthesis, achieving competitive or superior performance to externally rewarded baselines on GenEval, T2I-CompBench, and WISE benchmarks, particularly for smaller base models. The work also demonstrates that intrinsic rewards can foster general T2I abilities and long-form reasoning, with forward KL-based NSC outperforming entropy-based formulations. Overall, IRIS presents a scalable, architecture-agnostic approach to RL for T2I that reduces reliance on labeled data and external verifiers while enhancing reasoning and visual quality.

Abstract

Despite the success of Reinforcement Learning from Human Feedback (RLHF) in language reasoning, its application to autoregressive Text-to-Image (T2I) generation is often constrained by the limited availability of human preference data. This paper explores how an autoregressive T2I model can learn from internal signals without relying on external rewards or labeled data. Contrary to recent findings in text generation, we show that maximizing self-uncertainty, rather than self-certainty, improves image generation. We observe that this is because autoregressive T2I models with low uncertainty tend to generate simple and uniform images, which are less aligned with human preferences. Based on these observations, we propose IRIS (Intrinsic Reward Image Synthesis), the first framework to improve autoregressive T2I models with reinforcement learning using only an intrinsic reward. Empirical results demonstrate that applying IRIS to autoregressive T2I models achieves performance that is competitive with or superior to external rewards.

IRIS: Intrinsic Reward Image Synthesis

TL;DR

This paper tackles the scarcity of human preference data for training autoregressive text-to-image (T2I) models by introducing IRIS, a framework that uses intrinsic self-uncertainty signals as rewards. Contrary to findings in text-only domains, maximizing self-certainty harms image diversity; instead, NSC-based intrinsic rewards encourage richer, more human-aligned images. IRIS leverages GRPO and semantic Chains of Thought to guide image synthesis, achieving competitive or superior performance to externally rewarded baselines on GenEval, T2I-CompBench, and WISE benchmarks, particularly for smaller base models. The work also demonstrates that intrinsic rewards can foster general T2I abilities and long-form reasoning, with forward KL-based NSC outperforming entropy-based formulations. Overall, IRIS presents a scalable, architecture-agnostic approach to RL for T2I that reduces reliance on labeled data and external verifiers while enhancing reasoning and visual quality.

Abstract

Despite the success of Reinforcement Learning from Human Feedback (RLHF) in language reasoning, its application to autoregressive Text-to-Image (T2I) generation is often constrained by the limited availability of human preference data. This paper explores how an autoregressive T2I model can learn from internal signals without relying on external rewards or labeled data. Contrary to recent findings in text generation, we show that maximizing self-uncertainty, rather than self-certainty, improves image generation. We observe that this is because autoregressive T2I models with low uncertainty tend to generate simple and uniform images, which are less aligned with human preferences. Based on these observations, we propose IRIS (Intrinsic Reward Image Synthesis), the first framework to improve autoregressive T2I models with reinforcement learning using only an intrinsic reward. Empirical results demonstrate that applying IRIS to autoregressive T2I models achieves performance that is competitive with or superior to external rewards.

Paper Structure

This paper contains 38 sections, 8 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: We perform reinforcement learning (RL) fine-tuning on Janus-Pro using three reward schemes: (1) external reward (pretrained image reward models, etc), (2) self-certainty reward, and (3) negative self-certainty reward. The self-certainty is computed as the negative cross-entropy between the model's output distribution and a uniform distribution—where higher self-certainty indicates greater model self-confidence. The figure presents results across three tasks: (i) single-object generation, (ii) spatial generation, and (iii) two-object generation. We observe that increased self-confidence typically results in more uniform and less visually diverse images, while lower self-confidence tends to generate images with richer visual features that are more preferred by humans. Please refer to \ref{['app:subsec:visualized_results']} for more visualized results.
  • Figure 2: Self-Certainty on image tokens in the Janus-Pro-1B (orange line, right $y$-axis), and on text tokens in the Qwen2.5-1.5B-Instruct (blue line, left $y$-axis).
  • Figure 3: Main results of Janus-Pro 1B on GenEval, T2I-CompBench and WISE.
  • Figure 4: Visualization of semantic CoTs. The left one is training without semantic CoTs, and the right one is training with semantic CoTs.
  • Figure 5: Ablation study: minimizing image self-certainty outperforms maximizing it.
  • ...and 9 more figures