Table of Contents
Fetching ...

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

TL;DR

SpatialReward tackles the perception gap in online RL for image editing by introducing explicit spatial reasoning that anchors evaluation to predicted edit regions. Through the Think-with-Boxes mechanism, a two-stream SC/PQ evaluation, and a spatial-prior data pipeline, the authors achieve state-of-the-art alignment on reward benchmarks and substantial improvements in RL performance for image editing. The SpatialReward-260k dataset and MER-Bench provide strong resources to push forward cross-image verification and multi-region reasoning. Empirical results show consistent gains across MMRB2, EditReward-Bench, and OmniGen2-based RL, demonstrating that spatial grounding yields more reliable, efficient, and human-aligned editing feedback. The work highlights the practical impact of structured regional reasoning for scalable, aligned image-editing systems.

Abstract

Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

TL;DR

SpatialReward tackles the perception gap in online RL for image editing by introducing explicit spatial reasoning that anchors evaluation to predicted edit regions. Through the Think-with-Boxes mechanism, a two-stream SC/PQ evaluation, and a spatial-prior data pipeline, the authors achieve state-of-the-art alignment on reward benchmarks and substantial improvements in RL performance for image editing. The SpatialReward-260k dataset and MER-Bench provide strong resources to push forward cross-image verification and multi-region reasoning. Empirical results show consistent gains across MMRB2, EditReward-Bench, and OmniGen2-based RL, demonstrating that spatial grounding yields more reliable, efficient, and human-aligned editing feedback. The work highlights the practical impact of structured regional reasoning for scalable, aligned image-editing systems.

Abstract

Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.
Paper Structure (51 sections, 1 equation, 15 figures, 7 tables)

This paper contains 51 sections, 1 equation, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Visualizing the Cross-Image Attention Gap.(a) Input Pair: An editing instruction ("Change the fabric to silk") is executed, but with subtle inconsistencies. (b) Baseline (Attention Collapse): Due to source neglect, the baseline fails to attend to the reference image, leading to a blind judgment that incorrectly approves the edit. (c) SpatialReward (Cross-Verification): By anchoring reasoning to explicit spatial regions (red boxes), our model restores cross-image attention, enabling grounded verification that correctly detects the style deviation.
  • Figure 2: Overview of SpatialReward and Comparison with Baseline. (Left) The baseline (EditScore) lacks spatial guidance, leading to Attention Collapse and hallucinatory judgments; specifically, it overlooks the removal of the doctor's mask and the alteration of the patient's pose. (Right) Our SpatialReward employs a Think-with-Boxes mechanism: it first predicts bounding boxes (Edit Region) and injects them as interleaved tokens to anchor the subsequent reasoning. This enforces cross-verification (visualized by rectified attention maps), enabling precise detection of fine-grained inconsistencies (e.g., missing mask, altered pose) and ensuring aligned scoring.
  • Figure 3: Illustration of the Spatial-Prior-Guided Data Pipeline. We construct a highly structured dataset by leveraging spatial priors. This involves spatial grounding via Qwen-3-VL, expert routing for reasoning annotations (using Gemini and GPT series), and a strict alignment verification process.
  • Figure 4: Online RL Training Dynamics on OmniGen2. (a) Reward progression of SpatialReward, providing a steady and dense optimization signal. (b) VIEScore improvement across 1,000 steps. Our Geometric Mean strategy maintains continuous progress and achieves a higher performance peak compared to the Bucket Principle and EditReward.
  • Figure 5: Qualitative Comparison of Online RL Optimization. While EditReward (the strongest discriminative baseline) achieves competitive benchmark scores, its lack of explicit consistency modeling leads to severe content drift during RL optimization, where the policy over-modifies unprompted regions. In contrast, SpatialReward explicitly models both instruction following and source consistency, ensuring balanced optimization that preserves the original context while faithfully executing edits.
  • ...and 10 more figures