Table of Contents
Fetching ...

RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

Tianyu Wang, Zhiyuan Ma, Qian Wang, Xinyi Zhang, Xinwei Long, Bowen Zhou

TL;DR

RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation that outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.

Abstract

Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.

RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

TL;DR

RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation that outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.

Abstract

Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.
Paper Structure (27 sections, 17 equations, 6 figures, 3 tables)

This paper contains 27 sections, 17 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of the generated image by Stable Diffusion 3.5 Large, Flux 1.0, RL-RIG (choosing Flux as base model), and the ground truth image, given a sophisticated prompt with spatial relationships. Our method captures the relations in blue better, and compensates for the deficiencies in Flux's results. Actually, even the ground truth image does not comply to all the given relationships in the annotation.
  • Figure 2: The overview of RL-RIG. The generation phase abides a Generate-Reflect-Edit paradigm; the training phase aims at shifting trajectories, assigning greater probabilities for positive trajectories while discouraging negative trajectories during sampling. Here, we suppose the Image Editor is composed of an inverse diffuser and a diffuser.
  • Figure 3: The Generate-Reflect-Edit framework, explained in a trajectory view. In each generation process, one of the possible trajectories is selected according to the random seeds. The VLM Checker will then reflect and check whether all parts of the prompt are satisfied. If not, the VLM Actor will provide an edit prompt and pass it to editor. Given the edit prompt, the Image Editor will perform inversion and reversion to explore a new possible trajectory. Dashed branches denote low‑advantage trajectories pruned by GRPO. The 'Inversion' is actually performed in Edit stage.
  • Figure 4: Illustration of two-phase training. For each phase a batch of responses are sampled, and group advantage is calculated by GRPO.
  • Figure 5: A succeeded trial of image generation by RL-RIG, with the input prompt (id=523378) and the reasoning process. After reflection, the actor successfully guides the Image Editor to add a two-wheeled vehicle in front of the black car.
  • ...and 1 more figures