Table of Contents
Fetching ...

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo

TL;DR

This work proposes InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes and proposes two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability.

Abstract

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

TL;DR

This work proposes InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes and proposes two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability.

Abstract

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
Paper Structure (17 sections, 5 equations, 20 figures, 7 tables)

This paper contains 17 sections, 5 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: InterCoG, a novel framework that achieves spatially precise image editing in complex scenes via interleaved chain-of-grounding reasoning. InterCoG first conducts position reasoning (textual grounding), highlights the bounding boxes and masks on the image (visual grounding), and then rewrites the editing description to produce the final result. InterCoG achieves superior editing performance compared to state-of-the-art methods. Our results are more precise, making the model highly effective for realistic applications.
  • Figure 2: GroundEdit-45K Dataset Construction Pipeline and Statistics.Left: The pipeline consists of three steps: (1) target selection and visual grounding generation; (2) instruction and text reasoning generation; and (3) context-preserving local target editing. Right: Our dataset contains 45K samples covering 8 categories of local editing types.
  • Figure 3: Overview of the Proposed InterCoG Framework.Left: Our framework performs text–vision interleaved chain-of-grounding reasoning to interpret and locate user-intended targets and formulate editing descriptions, ultimately producing precise and semantically aligned editing results. Right: Illustration of the proposed text–vision and vision–vision reasoning alignment schemes, designed to enforce coherent multimodal grounding reasoning.
  • Figure 4: Qualitative comparisons on our proposed GroundEdit-Bench. Our method consistently delivers highly precise and spatially accurate edits, particularly in multi-entity and fine-grained reasoning scenarios. Best viewed at screen!
  • Figure 5: Visualization of the interleaved localization chain-of-thought reasoning processing. InterCoG first interprets user-intended referential targets via textual reasoning and then highlights the object via visualizing bounding boxes and masks in pixel space. By interleaving these multimodal grounding cues, InterCoG is able to precisely locate the editing regions and achieve spatially exact modifications. The editing results of competing methods can be found in the appendix. Best viewed at screen!
  • ...and 15 more figures