Table of Contents
Fetching ...

Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Zhentao Zou, Zhengrong Yue, Kunpeng Du, Binlei Bao, Hanting Li, Haizhen Xie, Guozheng Xu, Yue Zhou, Yali Wang, Jie Hu, Xue Jiang, Xinghao Chen

TL;DR

This work tackles NL-driven image editing in complex scenes by introducing MURE, which interleaves textual and visual reasoning to decompose edits into sub-tasks. A Multimodal Deep Confidence (MMDC) mechanism guards against hallucinations by exploring multiple visual reasoning paths and pruning低-quality branches using a reward model, yielding more reliable intermediate steps and final outputs. The authors release the CoT-Edit-14K dataset and demonstrate that MURE achieves strong performance across MagicBrush, Emu, and SmartEdit benchmarks, outperforming or matching state-of-the-art methods. Overall, the approach advances image editing by integrating explicit visual cues into the CoT process and enforcing path-level quality control, paving the way for more precise and trustworthy multimodal editing systems.

Abstract

Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.

Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

TL;DR

This work tackles NL-driven image editing in complex scenes by introducing MURE, which interleaves textual and visual reasoning to decompose edits into sub-tasks. A Multimodal Deep Confidence (MMDC) mechanism guards against hallucinations by exploring multiple visual reasoning paths and pruning低-quality branches using a reward model, yielding more reliable intermediate steps and final outputs. The authors release the CoT-Edit-14K dataset and demonstrate that MURE achieves strong performance across MagicBrush, Emu, and SmartEdit benchmarks, outperforming or matching state-of-the-art methods. Overall, the approach advances image editing by integrating explicit visual cues into the CoT process and enforcing path-level quality control, paving the way for more precise and trustworthy multimodal editing systems.

Abstract

Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.

Paper Structure

This paper contains 12 sections, 7 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Visualization of our interleaved text-visual reasoning process and a comparative result. Given the prompt "swap the tv for a lizard", the MURE model correctly performs multi-step reasoning to remove the lizard and its reflection in the mirror, generating a final edited image that maintains physical consistency. In contrast, baseline approaches fail to handle this complex editing tasks, leading to erroneous results.
  • Figure 2: Overview of the MURE framework. Left: Our framework leverages an interleaved text-image CoT to achieve high-fidelity image editing. Right: The Multimodal Deep Confidence (MMDC) reasoning explores a tree of visual reasoning paths at each step. It prunes low-quality branches based on a deep confidence score from a reward model, ensuring a superior trajectory toward the final edited image.
  • Figure 3: MURE Dataset Construction Process.Top: The visual annotation pipeline constructs explicit visual cues, including positional masks that define intended edited regions and valid representations of new content. Bottom: The textual annotation pipeline generates detailed textual descriptions based on the annotated CoT images from the top pipeline. The specific example illustrates the reasoning for an Obj.Repl. task, for detailed overview of the dataset, refer to Figure \ref{['dataset']}.
  • Figure 4: L1 Score vs. Output Token Cost on MagicBrush test set.
  • Figure 5: Visual comparison results. Additional examples are provided in Figures \ref{['casev']} and \ref{['cased']}.
  • ...and 15 more figures