Table of Contents
Fetching ...

Multi-Reward as Condition for Instruction-based Image Editing

Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie Zhu

TL;DR

This work tackles noisy supervision in instruction-based image editing by introducing RewardEdit-20K, a multi-perspective reward dataset, and Real-Edit, a real-image evaluation benchmark. It presents a Multi-Reward Condition (MRC) framework that encodes reward signals as embeddings and injects them into both the latent diffusion steps and the U-Net to guide editing. The authors show that multi-reward conditioning improves instruction following, detail preservation, and generation quality across InsPix2Pix and SmartEdit, achieving state-of-the-art performance on GPT-4o-based and human evaluations. The approach offers a practical method to enhance image editing quality without requiring perfectly ground-truth edited images, with broad implications for robust, user-driven content modification.

Abstract

High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text-to-image generative models (e.g., Stable Diffusion, DALL-E) which are not trained for image editing. Accordingly, these datasets suffer from inaccurate instruction following, poor detail preserving, and generation artifacts. In this paper, we propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. 1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i.e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality. For each perspective, we collected quantitative score in $0\sim 5$ and text descriptive feedback on the specific failure points in ground-truth edited images, resulting in a high-quality editing reward dataset, i.e., RewardEdit20K. 2) We further proposed a novel training framework to seamlessly integrate the metric output, regarded as multi-reward, into editing models to learn from the imperfect training triplets. During training, the reward scores and text descriptions are encoded as embeddings and fed into both the latent space and the U-Net of the editing models as auxiliary conditions. 3) We also build a challenging evaluation benchmark with real-world images/photos and diverse editing instructions, named Real-Edit. Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines, i.e., InsPix2Pix and SmartEdit. Code is released at https://github.com/bytedance/Multi-Reward-Editing.

Multi-Reward as Condition for Instruction-based Image Editing

TL;DR

This work tackles noisy supervision in instruction-based image editing by introducing RewardEdit-20K, a multi-perspective reward dataset, and Real-Edit, a real-image evaluation benchmark. It presents a Multi-Reward Condition (MRC) framework that encodes reward signals as embeddings and injects them into both the latent diffusion steps and the U-Net to guide editing. The authors show that multi-reward conditioning improves instruction following, detail preservation, and generation quality across InsPix2Pix and SmartEdit, achieving state-of-the-art performance on GPT-4o-based and human evaluations. The approach offers a practical method to enhance image editing quality without requiring perfectly ground-truth edited images, with broad implications for robust, user-driven content modification.

Abstract

High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text-to-image generative models (e.g., Stable Diffusion, DALL-E) which are not trained for image editing. Accordingly, these datasets suffer from inaccurate instruction following, poor detail preserving, and generation artifacts. In this paper, we propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. 1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i.e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality. For each perspective, we collected quantitative score in and text descriptive feedback on the specific failure points in ground-truth edited images, resulting in a high-quality editing reward dataset, i.e., RewardEdit20K. 2) We further proposed a novel training framework to seamlessly integrate the metric output, regarded as multi-reward, into editing models to learn from the imperfect training triplets. During training, the reward scores and text descriptions are encoded as embeddings and fed into both the latent space and the U-Net of the editing models as auxiliary conditions. 3) We also build a challenging evaluation benchmark with real-world images/photos and diverse editing instructions, named Real-Edit. Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines, i.e., InsPix2Pix and SmartEdit. Code is released at https://github.com/bytedance/Multi-Reward-Editing.

Paper Structure

This paper contains 37 sections, 6 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Existing image editing datasets and our method. Best viewed with zoom-in.
  • Figure 2: Generation process of reward data. Given the editing triplets, reward data was generated using GPT-4o by setting prompts from different perspectives.
  • Figure 3: Distribution of reward score.
  • Figure 4: Word cloud of reward text.
  • Figure 5: The overall framework of our approach. The original image $x$ is first encoded into an image condition by the VAE encoder. This image condition $c_I$ is then concatenated with latent noise $Z_t$ to serve as the query for the reward encoder, with the reward condition $c_R$ as the key/value. The resulting latent noise, containing reward information, is used as the input for the U-Net module. Meanwhile, the instruction is encoded into a text condition $c_T$ by the text encoder, which is fed into each block of the U-Net. To further enhance reward guidance, we incorporate the reward condition after each block. Finally, the U-Net's output is decoded by the VAE decoder into the edited image $y$.
  • ...and 11 more figures