Table of Contents
Fetching ...

CoEditor++: Instruction-based Visual Editing via Cognitive Reasoning

Minheng Ni, Yutao Fan, Zhengyuan Yang, Yeli Shen, Yuxiang Wei, Yaowen Zhang, Lijuan Wang, Lei Zhang, Wangmeng Zuo

TL;DR

CoEditor++ is a cognitively structured, training-free framework that decomposes editing into "what to edit" and "how to edit" through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing.

Abstract

Recent advances in large multimodal models (LMMs) have enabled instruction-based image editing, allowing users to modify visual content via natural language descriptions. However, existing approaches often struggle with high-level semantic reasoning and visual consistency, particularly under ambiguous or complex instructions. To address these challenges, we propose CoEditor++, a cognitively structured, training-free framework that decomposes editing into "what to edit" and "how to edit" through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing. Built entirely from open-sourced components, CoEditor++ requires no additional training or fine-tuning, ensuring transparency and cross-domain applicability. We evaluate CoEditor++ on SmartEdit, a widely used benchmark for general editing, and AltBear, a privacy and compliance-oriented benchmark. Experimental results show that CoEditor++ achieves state-of-the-art performance in both general editing and responsible editing tasks compared with open-sourced models that require training on specialized editing datasets maintaining significantly higher visual consistency. When compared with closed-source models such as Nano Banana Pro or GPT-4o, CoEditor++ preserves comparable instruction following while still substantially outperforming them in visual consistency. Extensive ablation studies confirm that the effectiveness of CoEditor++ benefits from its structured cognitive design rather than any specific model component. Our findings suggest the potential toward cognitive-centric instruction-based image editing.

CoEditor++: Instruction-based Visual Editing via Cognitive Reasoning

TL;DR

CoEditor++ is a cognitively structured, training-free framework that decomposes editing into "what to edit" and "how to edit" through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing.

Abstract

Recent advances in large multimodal models (LMMs) have enabled instruction-based image editing, allowing users to modify visual content via natural language descriptions. However, existing approaches often struggle with high-level semantic reasoning and visual consistency, particularly under ambiguous or complex instructions. To address these challenges, we propose CoEditor++, a cognitively structured, training-free framework that decomposes editing into "what to edit" and "how to edit" through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing. Built entirely from open-sourced components, CoEditor++ requires no additional training or fine-tuning, ensuring transparency and cross-domain applicability. We evaluate CoEditor++ on SmartEdit, a widely used benchmark for general editing, and AltBear, a privacy and compliance-oriented benchmark. Experimental results show that CoEditor++ achieves state-of-the-art performance in both general editing and responsible editing tasks compared with open-sourced models that require training on specialized editing datasets maintaining significantly higher visual consistency. When compared with closed-source models such as Nano Banana Pro or GPT-4o, CoEditor++ preserves comparable instruction following while still substantially outperforming them in visual consistency. Extensive ablation studies confirm that the effectiveness of CoEditor++ benefits from its structured cognitive design rather than any specific model component. Our findings suggest the potential toward cognitive-centric instruction-based image editing.
Paper Structure (44 sections, 10 equations, 9 figures, 6 tables)

This paper contains 44 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Example of high-level semantic reasoning and fine-grained visual consistency. Although the model followed the instructions, it failed to accurately identify the content that needed to be modified and made unnecessary changes to the image background. These problems will become increasingly pronounced during continuous editing.
  • Figure 2: Overview of the CoEditor++ with two stages of cognitive process and a reflective self-selection to find out the best intermediate result. In the localization cognitive process (LCP), a large multimodal model (LMM) jointly processes the input image $x$ and user instruction $c$ to generate a set of localization prompts $\mathbf{P}_{\mathrm{loc}}$, describing a candidate region each. These prompts are mapped to localization candidates $\mathbf{M}$ via a segmentation model. In the modification cognitive process (MCP), the LMM formulates a set of modification prompts $\mathbf{P}_{\mathrm{mdf}}$ based on the selected mask $m^{*}$, which guides an inpainting model to synthesize modification candidates $\mathbf{Y}$ and select the final edited result $y^{*}$ via reflective self-selection. This framework explicitly decouples “what to edit” and “how to edit”, enabling robust, interpretable, and contextually aligned visual editing.
  • Figure 3: Qualitative comparison with state-of-the-art methods. CoEditor++ consistently produces more precise and realistic edits that adhere to the user's intent, while preserving the background. In contrast, other methods often suffer from semantic misunderstandings, e.g., MagicBrush, unintended modifications to irrelevant areas, e.g., SmartEdit, or produce unrealistic artifacts, e.g., InstructPix2Pix.
  • Figure 4: Qualitative results of CoEditor++ and its ablated variants in instruction-based image editing. "NR" denotes a variant without reasoning, where region localization and modification are directly fused via segmentation and inpainting. "NR w/ GT Mask" represents the upper bound, using ground-truth localization masks with direct inpainting. As shown, both variants fail to achieve robust, user-intent-aligned, and visually consistent edits. NR produces excessive collateral changes, while "NR w/ GT Mask" still lacks semantic control. In contrast, CoEditor++ demonstrates precise localization, controlled modification, and minimal changes to irrelevant regions, closely mirroring human editing trajectories. These results reinforce our central insight: instruction-based image editing is fundamentally a reasoning-centric task, and performance improvements stem from structured cognitive coordination rather than any individual component or stronger visual models.
  • Figure 5: Robustness in continuous, multi-step editing. CoEditor++ maintains visual coherence and semantic fidelity through multiple rounds of editing. Even after several iterations, prior edits are preserved without introducing cumulative artifacts, demonstrating minimal error propagation and strong robustness in iterative editing scenarios.
  • ...and 4 more figures