Table of Contents
Fetching ...

WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark

Wang Lin, Feng Wang, Majun Zhang, Wentao Hu, Tao Jin, Zhou Zhao, Fei Wu, Jingyuan Chen, Alan Yuille, Sucheng Ren

TL;DR

The paper tackles implicit, world-knowledge-driven image editing by introducing WorldEdit, a dataset of 11k high-quality edits and WorldEdit-Test for causal reasoning evaluation. It adopts a two-stage Bagel training pipeline with structured CoT paraphrases and reinforcement learning guided by a composite reward that enforces reasoning, visual fidelity, and causal grounding ($R = R_{reason} + R_{fidelity} + R_{causal}$). Results show WorldEdit enables open-source models to achieve competitive performance with top systems on knowledge plausibility and instruction following, narrowing gaps with GPT-4o. This work provides a foundation for knowledge-aware image editing and offers a scalable benchmark for evaluating and improving world-knowledge integration in multimodal models.

Abstract

Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.

WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark

TL;DR

The paper tackles implicit, world-knowledge-driven image editing by introducing WorldEdit, a dataset of 11k high-quality edits and WorldEdit-Test for causal reasoning evaluation. It adopts a two-stage Bagel training pipeline with structured CoT paraphrases and reinforcement learning guided by a composite reward that enforces reasoning, visual fidelity, and causal grounding (). Results show WorldEdit enables open-source models to achieve competitive performance with top systems on knowledge plausibility and instruction following, narrowing gaps with GPT-4o. This work provides a foundation for knowledge-aware image editing and offers a scalable benchmark for evaluating and improving world-knowledge integration in multimodal models.

Abstract

Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.
Paper Structure (22 sections, 16 figures, 5 tables)

This paper contains 22 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Unlike traditional image editing (left), which adopts a uniform editing strategy for different editing objects, world editing (right) needs to take into account the nature of the editing objects in the real world and produce editing results that conform to causal logic.
  • Figure 2: The automated construction pipeline of the WorldEdit dataset. Open-world images are filtered and screened along three dimensions: (1) causal consistency of implicit instructions, (2) richness of the expected visual transformations, and (3) quality of the synthesized edited images.
  • Figure 3: Statistics of the WorldEdit dataset. (left) Distribution of word counts in paraphrased instruction, along with a word cloud of frequently edited objects. (right) Distribution of 10 editing instruction categories.
  • Figure 4: Qualitative comparison across different causal categories. The figure shows representative examples from ten causal reasoning tasks. Each row corresponds to a causal scenario, with the source image on the left followed by results from different models. Our method generates outputs that are both visually plausible and causally coherent, whereas baselines often produce irrelevant or stylistic edits, failing to reflect the causal logic of the instruction.
  • Figure 5: Qualitative results on WorldEdit-Test with paraphrased instructions. Text alone often fails to capture fine-grained causal details (e.g., scattering pattern of collapsed building blocks), and models vary in their ability to interpret such prompts. Our model, fine-tuned with WorldEdit, generates the most faithful and visually coherent images, underscoring the importance of high-quality world knowledge-driven data.
  • ...and 11 more figures