Table of Contents
Fetching ...

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, Wenhu Chen

TL;DR

EditReward addresses the bottleneck of reward signals in open-source instruction-guided image editing by creating EditReward-Data, a large-scale, expert-annotated preference dataset, and training a VLM-based reward model with multi-dimensional uncertainty-aware ranking. It introduces EditReward-Bench for robust evaluation and demonstrates state-of-the-art human alignment on GenAI-Bench, AURORA-Bench, and ImagenHub, as well as practical gains in data curation by filtering noisy datasets to improve downstream editors. The work provides open resources (dataset, model, benchmark) and shows potential for reinforcement learning-based post-training and test-time scaling of editing models. Its methodological innovations—multi-dimensional uncertainty modeling, tie-disentanglement, and structured multi-way preferences—offer a scalable path to higher-quality instruction-following image edits.

Abstract

Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname's ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

TL;DR

EditReward addresses the bottleneck of reward signals in open-source instruction-guided image editing by creating EditReward-Data, a large-scale, expert-annotated preference dataset, and training a VLM-based reward model with multi-dimensional uncertainty-aware ranking. It introduces EditReward-Bench for robust evaluation and demonstrates state-of-the-art human alignment on GenAI-Bench, AURORA-Bench, and ImagenHub, as well as practical gains in data curation by filtering noisy datasets to improve downstream editors. The work provides open resources (dataset, model, benchmark) and shows potential for reinforcement learning-based post-training and test-time scaling of editing models. Its methodological innovations—multi-dimensional uncertainty modeling, tie-disentanglement, and structured multi-way preferences—offer a scalable path to higher-quality instruction-following image edits.

Abstract

Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname's ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.

Paper Structure

This paper contains 26 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: An overview of our framework, illustrating the construction of the EditReward-Data and the subsequent training of our reward model, EditReward. Top: The data pipeline, where we generate a diverse candidate pool from multiple state-of-the-art models and collect multi-dimensional human preference annotations. Bottom: The model pipeline, where EditReward is optimized on EditReward-Data using our proposed Multi-Dimensional Uncertainty-Aware Ranking Loss for training, followed by its use in inference.
  • Figure 2: Statistics of our EditReward-Data and EditReward-Bench.
  • Figure 3: Representative examples of our reward model aligning with human judgments.
  • Figure 4: Annotation Interface
  • Figure 5: Loss curve and Valid set Acc by using or not using Disentangling Ties via Dimensional Preference during model training.