Table of Contents
Fetching ...

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia

TL;DR

IV-Complexity captures the challenge of performing precise edits in cluttered scenes under complex instructions. RePlan introduces a region-grounded, plan–execute framework that couples a vision–language planner with a diffusion editor, using a training-free attention region injection and GRPO reinforcement learning with ~1k instruction-only examples. It also proposes IV-Edit, a benchmark designed to stress fine-grained grounding and knowledge-driven edits. Across IV-Complexity settings, RePlan achieves superior regional precision and fidelity compared with data-hungry baselines, enabling efficient multi-region edits in a single pass. This work advances controllable, knowledge-aware image editing in realistic, complex scenes.

Abstract

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

TL;DR

IV-Complexity captures the challenge of performing precise edits in cluttered scenes under complex instructions. RePlan introduces a region-grounded, plan–execute framework that couples a vision–language planner with a diffusion editor, using a training-free attention region injection and GRPO reinforcement learning with ~1k instruction-only examples. It also proposes IV-Edit, a benchmark designed to stress fine-grained grounding and knowledge-driven edits. Across IV-Complexity settings, RePlan achieves superior regional precision and fidelity compared with data-hungry baselines, enabling efficient multi-region edits in a single pass. This work advances controllable, knowledge-aware image editing in realistic, complex scenes.

Abstract

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

Paper Structure

This paper contains 19 sections, 8 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Overview of our RePlan framework. The bottom part of the figure shows the overall architecture. Given an input image and text instruction, the VLM analyzes them via chain-of-thought reasoning and produces region-aligned guidance, where each guidance includes a region bbox and its editing hint. Each hint is futher encoded by a text encoder into a feature token, while image patch tokens are obtained by VAE encoding and grouped according to the region bounding boxes. A group-specific attention mechanism, detailed in Figure \ref{['fig:mask']}, is proposed to allow MMDiT to generate the final edited image. The top part of the figure presents an editing examples.
  • Figure 2: VLM output format Example
  • Figure 3: Attention rule visualization. We use different highlight colors to indicate different rules, which correspond to Hint isolation, Region constraint, Background constraint and Image–latent full interaction.
  • Figure 4: Overview of our IV-Edit Benchmark. (a) and (b) respectively shows the distribution of referring types and task types across the dataset. IV-Edit is explicitly designed to reflect the IV-Complexity challenge, where user instructions require aligning fine-grained language with rich and diverse visual contexts. (c) presents visual examples spanning a wide range of real-world scenarios and fine-grained instruction intents—including spatial, structural, and reasoning-intensive edits. Each instruction is decomposed into a referring expression and a task type, reflecting the need for both grounded understanding and visual transformation.
  • Figure 5: Editing results comparison. We use Flux.1 Kontext dev as the backbone of RePlan. Notably, GPT-4o enforces fixed aspect ratios, leading to unavoidable cropping for non-standard images.
  • ...and 10 more figures