Table of Contents
Fetching ...

InstructRL4Pix: Training Diffusion for Image Editing by Reinforcement Learning

Tiancheng Li, Jinxiu Liu, Huajun Chen, Qi Liu

TL;DR

InstructRL4Pix presents a diffusion-model editing framework guided by reinforcement learning that leverages attention-map alignment to localize edits to target objects described by natural language. By combining an Attention Map Loss with a Clip Loss and optimizing via proximal policy optimization within a multi-step MDP formulation, the method achieves precise, instruction-aligned edits while preserving most original image content. The approach is evaluated on object insertion, removal, replacement, and transformation tasks using the MagicBrush-pretraining data, and demonstrates state-of-the-art or competitive results in pixel-level fidelity and perceptual quality, with ablations showing the benefits of the joint reward design. This RL-guided diffusion framework reduces reliance on external GPT-3/prompt-based data generation, enabling unsupervised optimization of editing goals and offering a scalable path for robust vision-language editing.

Abstract

Instruction-based image editing has made a great process in using natural human language to manipulate the visual content of images. However, existing models are limited by the quality of the dataset and cannot accurately localize editing regions in images with complex object relationships. In this paper, we propose Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object. Our method maximizes the output of the reward model by calculating the distance between attention maps as a reward function and fine-tuning the diffusion model using proximal policy optimization (PPO). We evaluate our model in object insertion, removal, replacement, and transformation. Experimental results show that InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.

InstructRL4Pix: Training Diffusion for Image Editing by Reinforcement Learning

TL;DR

InstructRL4Pix presents a diffusion-model editing framework guided by reinforcement learning that leverages attention-map alignment to localize edits to target objects described by natural language. By combining an Attention Map Loss with a Clip Loss and optimizing via proximal policy optimization within a multi-step MDP formulation, the method achieves precise, instruction-aligned edits while preserving most original image content. The approach is evaluated on object insertion, removal, replacement, and transformation tasks using the MagicBrush-pretraining data, and demonstrates state-of-the-art or competitive results in pixel-level fidelity and perceptual quality, with ablations showing the benefits of the joint reward design. This RL-guided diffusion framework reduces reliance on external GPT-3/prompt-based data generation, enabling unsupervised optimization of editing goals and offering a scalable path for robust vision-language editing.

Abstract

Instruction-based image editing has made a great process in using natural human language to manipulate the visual content of images. However, existing models are limited by the quality of the dataset and cannot accurately localize editing regions in images with complex object relationships. In this paper, we propose Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object. Our method maximizes the output of the reward model by calculating the distance between attention maps as a reward function and fine-tuning the diffusion model using proximal policy optimization (PPO). We evaluate our model in object insertion, removal, replacement, and transformation. Experimental results show that InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.
Paper Structure (34 sections, 20 equations, 4 figures, 4 tables)

This paper contains 34 sections, 20 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We introduce Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to unsupervised optimize instruction-based image editing for various editing tasks. The bottom is the edit instruction, the middle is the input image, and the top is the output image after InstructRL4Pix editing.
  • Figure 2: Overview of Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object. InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.
  • Figure 3: Sample progressions of the same cue and random seeds during training. The attention map of the samples will tend to localize more faithfully to the correct editing region.
  • Figure 4: