Table of Contents
Fetching ...

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu

TL;DR

This work tackles the bottleneck of applying online reinforcement learning to image editing by introducing EditReward-Bench, a comprehensive benchmark for reward-model evaluation, and EditScore, a family of high-fidelity, open-source reward models. EditScore demonstrates strong open-source performance, scalable inference-time self-ensembling, and the ability to surpass larger proprietary baselines on benchmark tasks. The authors validate EditScore as a reliable learning signal that enables stable RL training for editing models like OmniGen2, achieving substantial performance gains across multiple editing benchmarks. The study emphasizes reward fidelity and variance as key drivers of RL success and provides public tools to foster future RL-enabled editing research. The practical impact is a viable, open pathway to improve instruction-guided image editing via RL with robust, scalable reward signals.

Abstract

Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

TL;DR

This work tackles the bottleneck of applying online reinforcement learning to image editing by introducing EditReward-Bench, a comprehensive benchmark for reward-model evaluation, and EditScore, a family of high-fidelity, open-source reward models. EditScore demonstrates strong open-source performance, scalable inference-time self-ensembling, and the ability to surpass larger proprietary baselines on benchmark tasks. The authors validate EditScore as a reliable learning signal that enables stable RL training for editing models like OmniGen2, achieving substantial performance gains across multiple editing benchmarks. The study emphasizes reward fidelity and variance as key drivers of RL success and provides public tools to foster future RL-enabled editing research. The practical impact is a viable, open pathway to improve instruction-guided image editing via RL with robust, scalable reward signals.

Abstract

Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

Paper Structure

This paper contains 44 sections, 21 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the annotation process. Annotators are presented with five candidate output images and are asked to rank them according to three evaluation dimensions. The final ranking is determined through consensus among multiple annotators. For example, $1|234|5$ indicates that the first image is preferred over images 2, 3 and 4, which are in turn preferred over image 5.
  • Figure 2: Self-ensembling offers a superior efficiency-performance trade-off compared to simply scaling model parameters. The colored solid lines show the performance scaling of our 7B, 32B, and 72B models as the number of ensemble passes (K) increases. The gray dashed line connects the single-pass (K=1) performance of these models, serving as a baseline for scaling model size alone. The results clearly indicate that scaling the number of forward passes yields a significantly higher accuracy gain per unit of computational cost.
  • Figure 3: EditScore as a superior reward signal for image editing. (a) Using EditScore to select the best sample among multiple outputs effectively improves VIEScore, with OmniGen2 showing the largest gain. (b) Incorporating EditScore into RL training yields stable and significant performance improvements, even surpassing the much larger Qwen2.5-VL-72B. (c) RL training benefits from self-ensembling, which enhances the evaluation accuracy of EditScore across diverse settings.
  • Figure 4: Input images' categories.
  • Figure 5: Qualitative results on image editing task.
  • ...and 1 more figures