Table of Contents
Fetching ...

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

Yuhui Wu, Chenxi Xie, Ruibin Li, Liyi Chen, Qiaosi Yi, Lei Zhang

TL;DR

A region-based regularizer is proposed, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples, to overcome the spatial-agnostic nature of the rewards.

Abstract

Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

TL;DR

A region-based regularizer is proposed, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples, to overcome the spatial-agnostic nature of the rewards.

Abstract

Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.
Paper Structure (36 sections, 29 equations, 22 figures, 6 tables, 1 algorithm)

This paper contains 36 sections, 29 equations, 22 figures, 6 tables, 1 algorithm.

Figures (22)

  • Figure 1: Performance comparison on GEdit-Bench and ImgEdit-Bench. The PSNR values are calculated on the non-edit region. The results illustrate a clear conflict between editing quality (MLLM Score) and content consistency (PSNR) of existing editing models. Notably, with our CoCoEdit, both the editing quality and content consistency can be improved.
  • Figure 2: Visual examples of state-of-the-art image editing models. We see that models with strong editing capabilities may fail to preserve non-edited objects (e.g., the pillow in the first row), and existing post-training methods can further degrade the consistency of original content. In contrast, our CoCoEdit can improve both editing effects and content consistency.
  • Figure 3: Data construction pipeline of our CoCoEdit-40K. In the blue part, we annotate the editing masks, augment them and refine the instruction. In the green part, we filter the augmented samples using Qwen2.5-VL based on the alignment quality between instructions and inputs, and the accuracy of masks.
  • Figure 4: Statistics of CoCoEdit-40K. We only consider local editing types for CoCoEdit training, while the finetuned models show generalization capability to global editing types.
  • Figure 5: Illustration of our CoCoEdit framework, which consists of three stages in each iteration. (1) Sample Collection. Given the input image and instruction, we collect multiple samples for reward calculation. (2) Reward Collection. In addition to the MLLM reward, we propose a pixel-level similarity reward that computes PSNR and SSIM in the non-edit region to identify the inconspicuous differences ignored by MLLM. (3) Policy Optimization. The current policy model $v_{\theta}$ is trained by the negative-aware loss terms $\mathcal{L}^+,\mathcal{L}^-$ and our region-based regularizers $L_{ner}^+, L_{er}^-$, which are weighted by the reward signal $r$, where $L_{ner}^+$ aims to constrain the similarity in non-edit region and $L_{er}^-$ promotes the editing effects in the edit region.
  • ...and 17 more figures