Table of Contents
Fetching ...

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng, Hang Hua, Jiebo Luo

TL;DR

MIRA introduces a lightweight multimodal reasoning agent that reframes instruction-guided image editing as an iterative perception–reasoning–action loop. By training with a new 150K-trajectory dataset (MIRA-Editing) and a two-stage SFT+GRPO pipeline, MIRA learns to predict atomic edits step-by-step and leverage visual feedback to refine results using open-source editors. The approach demonstrates consistent improvements in semantic consistency and perceptual quality across multiple backbones, approaching or surpassing proprietary systems, and offers robustness through closed-loop error mitigation and a dynamic termination mechanism. This work delivers a scalable, plug-and-play framework for complex editing instructions, with practical open-source applicability and significant implications for controllable image editing.

Abstract

Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

TL;DR

MIRA introduces a lightweight multimodal reasoning agent that reframes instruction-guided image editing as an iterative perception–reasoning–action loop. By training with a new 150K-trajectory dataset (MIRA-Editing) and a two-stage SFT+GRPO pipeline, MIRA learns to predict atomic edits step-by-step and leverage visual feedback to refine results using open-source editors. The approach demonstrates consistent improvements in semantic consistency and perceptual quality across multiple backbones, approaching or surpassing proprietary systems, and offers robustness through closed-loop error mitigation and a dynamic termination mechanism. This work delivers a scalable, plug-and-play framework for complex editing instructions, with practical open-source applicability and significant implications for controllable image editing.

Abstract

Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

Paper Structure

This paper contains 22 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Qualitative comparison of MIRA against leading proprietary and open-source image editing models on complex instructions. The rightmost column illustrates MIRA's unique iterative reasoning and editing process, displaying the intermediate visual results after each step of its perception-reasoning-action loop.
  • Figure 2: Workflow of our multimodal reasoning and editing agent MIRA. Given an input image and a complex natural-language instruction, MIRA engages in an iterative perception–reasoning–action loop. At each step, the agent analyzes the current visual state and textual context to generate an atomic edit instruction, which is executed by an external image-editing model. The updated image is fed back into the agent to guide the next step. This loop continues until the full instruction is satisfied, yielding the final edited result.
  • Figure 3: Three types of editing samples in MIRA-Editing.
  • Figure 4: Overview of the MIRA Training Pipeline. The training pipeline comprises two stages: (1) Supervised Fine-Tuning and (2) Reinforcement Learning. Stage 1 fine-tunes Qwen2.5-VL-7B-Instruct on paired samples of the input image, the previously edited image, and the complex instruction to initialize the policy model. Stage 2 applies GRPO to further refine the policy, using a composite reward function that couples an image editing model with an editing reward model to score edit quality and provide optimization signals.
  • Figure 5: Qualitative Case Study for MIRA's Error Mitigation Capability. Atomic 1: Replace the floor to wooden floor., Atomic 2: Change the color of the white cabinet to wooden brown., Atomic 3: Change the color of the white stove to black., Atomic 4: Change the color of the wooden refrigerator to white., Atomic 5: Change the color of the white stove to black, Atomic 6: 〈Stop〉.
  • ...and 2 more figures