Table of Contents
Fetching ...

Generative Visual Chain-of-Thought for Image Editing

Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He, Yu Xu, Chunyu Wang, Bing Li, Zheng Chang, Kongming Liang, Qinglin Lu, Zhanyu Ma

TL;DR

This work proposes Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit, and demonstrates that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit.

Abstract

Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.

Generative Visual Chain-of-Thought for Image Editing

TL;DR

This work proposes Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit, and demonstrates that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit.

Abstract

Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
Paper Structure (11 sections, 3 equations, 7 figures, 6 tables)

This paper contains 11 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Generative Visual Chain-of-Thought (GVCoT). A comparison of three reasoning paradigms: (a) Text CoT, which reasons purely within the text space; (b) Visual CoT (with Tools), which leverages external tools to highlight target regions; and (c) Our GVCoT, which performs native visual reasoning via a generative diffusion process within a unified space.
  • Figure 2: Comparing spatial cue representation for image editing on ImgEdit ye2025imgedit. We study two ways of injecting spatial information: (1) text modality uses bounding box coordinates, and (2) visual modality providing a binary mask. Providing spatial information in the visual modality yields a greater improvement in both instruction adherence and background preservation.
  • Figure 3: Supervised Fine-Tuning of our GVCoT training recipe. Stage 1: Multi-Task Visual Manipulation, where the model's generation expert is trained in a multi-task setup to inject the newly masking skill. Stage 2: Visual Reason-aided Editing, where the entire model is trained to generate a faithful and interpretable visual reasoning image and then an edited image within a single sequence.
  • Figure 4: GVCoT-Edit-Instruct Data Pipeline. Left: We design a scalable multi-stage data pipeline to curate high-quality samples with faithful editing region annotations, i.e., bounding boxes and masks. Right: The distribution of GVCoT-Edit-Instruct spanning 19 tasks.
  • Figure 5: Illustration of the SREdit-Bench. Left: We provide challenging scenarios featuring complex scenes and fine-grained referring expressions. Right: (a) We quantify scene complexity by counting editable objects and regions. Results show that SpaEdit-Bench concentrates on more sophisticated scenes than ImgEdit ye2025imgedit and GEdit-Bench liu2025step1x. (b) Referral type distribution. (c) Edit tasks distribution.
  • ...and 2 more figures