Table of Contents
Fetching ...

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou

TL;DR

This work introduces a novel method that strengthens the spatial understanding of current image generation models and introduces a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.

Abstract

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.

Enhancing Spatial Understanding in Image Generation via Reward Modeling

TL;DR

This work introduces a novel method that strengthens the spatial understanding of current image generation models and introduces a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.

Abstract

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
Paper Structure (31 sections, 6 equations, 11 figures, 7 tables)

This paper contains 31 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Failure of Reward Models on Spatial Understanding. Existing reward models ma2025hpsv3kirstain2023pickscorexu2023imagerewardlin2024vqascore often assign higher reward values to spatially incorrect images than to spatially correct ones, thereby exposing their limited spatial reasoning capabilities.
  • Figure 2: Limitations of GenEval ghosh2023geneval as the reward model. (a) GenEval-based RL training fails to generalize to long prompts involving complex spatial relationships across multiple objects. (b) The rule-based GenEval rewards, which rely on object detectors, often produce incorrect evaluations under visual challenges like occlusion, while modern VLMs can accurately infer the correct response.
  • Figure 3: Overview of our SpatialReward-Dataset.
  • Figure 4: GRPO training pipeline for enhancing spatial unserstanding. We first samples a group of images from the policy model and uses our specialized SpatialScore to rate their spatial accuracy. After ranking based on these scores, we select the top-$k$ most accurate and bottom-$k$ least accurate examples and convert these scores into advantage signals. The policy model is updated via policy gradient optimization to directly reward correct spatial layouts and penalize errors, thereby enhancing the base model’s spatial understanding.
  • Figure 5: Advantage bias. For easy prompts with many high-reward samples, some high-quality samples often obtain negative advantages due to the high group mean.
  • ...and 6 more figures