Table of Contents
Fetching ...

IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection

Fei Shen, Chengyu Xie, Lihong Wang, Zhanyi Zhang, Xin Jiang, Xiaoyu Du, Jinhui Tang

Abstract

Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose \textbf{IMAGAgent}, a multi-turn image editing agent framework based on a "plan-execute-reflect" closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint-aware planning module that leverages a vision-language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub-tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the tool-chain orchestration module dynamically constructs execution paths based on the current image, the current sub-task, and the historical context, enabling adaptive scheduling and collaborative operation among heterogeneous operation models covering image retrieval, segmentation, detection, and editing. Finally, we devise a multi-expert collaborative reflection mechanism where a central large language model (LLM) receives the image to be edited and synthesizes VLM critiques into holistic feedback, simultaneously triggering fine-grained self-correction and recording feedback outcomes to optimize future decisions. Extensive experiments on our constructed \textbf{MTEditBench} and the MagicBrush dataset demonstrate that IMAGAgent achieves performance significantly superior to existing methods in terms of instruction consistency, editing precision, and overall quality. The code is available at https://github.com/hackermmzz/IMAGAgent.git.

IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection

Abstract

Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose \textbf{IMAGAgent}, a multi-turn image editing agent framework based on a "plan-execute-reflect" closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint-aware planning module that leverages a vision-language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub-tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the tool-chain orchestration module dynamically constructs execution paths based on the current image, the current sub-task, and the historical context, enabling adaptive scheduling and collaborative operation among heterogeneous operation models covering image retrieval, segmentation, detection, and editing. Finally, we devise a multi-expert collaborative reflection mechanism where a central large language model (LLM) receives the image to be edited and synthesizes VLM critiques into holistic feedback, simultaneously triggering fine-grained self-correction and recording feedback outcomes to optimize future decisions. Extensive experiments on our constructed \textbf{MTEditBench} and the MagicBrush dataset demonstrate that IMAGAgent achieves performance significantly superior to existing methods in terms of instruction consistency, editing precision, and overall quality. The code is available at https://github.com/hackermmzz/IMAGAgent.git.

Paper Structure

This paper contains 17 sections, 10 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Existing methods suffer from error accumulation due to isolated execution. (b) IMAGAgent ensures structural and intentional consistency through closed-loop feedback.
  • Figure 2: IMAGAgent Framework. We reformulate multi-turn image editing into a closed-loop process with three modules: (1) constraint-aware planning for VLM-based instruction decomposition; (2) tool-chain orchestration for dynamic scheduling of vision tools; and (3) multi-expert collaborative reflection for fine-grained critique and self-correction.
  • Figure 3: Statistics of MTEditBench. The figure displays the distribution of editing operation types and sequence lengths.
  • Figure 4: Qualitative results on MTEditBench. Comparisons show IMAGAgent achieves top visual quality.
  • Figure 5: Qualitative results on MagicBrush. IMAGAgent outperforms SOTA in complex spatial and attribute constraints.
  • ...and 8 more figures