Table of Contents
Fetching ...

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, Trung Bui

TL;DR

This work proposes ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents.

Abstract

With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

TL;DR

This work proposes ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents.

Abstract

With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.
Paper Structure (31 sections, 5 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 5 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of ImageEdit-R1: ① The decomposition agent analyzes the user instruction and input image to extract a structured representation of the desired edits, including editing actions, subjects, and goals. ② The sequencing agent arranges these components into an ordered list of sub-requests, enabling interpretable and modular execution. ③ The editing agent, built on a diffusion model, performs the actual image edits by sequentially applying the sub-requests.
  • Figure 2: Representative examples demonstrating the performance of ImageEdit-R1 compared to baselines on complex editing tasks.
  • Figure 3: Training rewards of decomposition agent on Qwen2.5-VL-7B-Instruction.
  • Figure 4: Human vs. VLM-based evaluation. Top rows show average scores; bottom row shows their correlation.
  • Figure 5: Comparison of multi-turn and single-turn strategies in the image editing process.
  • ...and 3 more figures