Table of Contents
Fetching ...

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, Jing Zhang

TL;DR

The paper addresses the reliability of multimodal reasoning in vision-language models by moving beyond text-only chains of thought to an image-interactive paradigm. It introduces DeepSketcher, a two-part suite: a code-rendered dataset with interleaved image–text reasoning and a self-contained model that reasons and edits visual embeddings directly, removing the need for external tools. Through a three-phase training strategy and an agentic data-collection pipeline, DeepSketcher demonstrates strong performance gains across multiple multimodal benchmarks, especially in geometry, counting, and logic-related tasks, while maintaining robustness through ablation studies. The work reduces grounding noise and enables flexible, tool-free visual reasoning, offering a practical pathway toward more reliable and transparent multimodal intelligence in constrained domains with open-source resources.

Abstract

The "thinking with images" paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and more faithful multimodal reasoning. As an emerging paradigm, however, it still leaves substantial room for exploration in data construction accuracy, structural design, and broader application scenarios, which offer rich opportunities for advancing multimodal reasoning. To further advance this line of work, we present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model. The dataset contains 31k chain-of-thought (CoT) reasoning trajectories with diverse tool calls and resulting edited images, covering a wide range of data types and manipulation instructions with high annotation accuracy. Building on this resource, we design a model that performs interleaved image-text reasoning and natively generates "visual thoughts" by operating directly in the visual embedding space, rather than invoking external tools and repeatedly re-encoding generated images. This design enables tool-free and more flexible "thinking with images". Extensive experiments on multimodal reasoning benchmarks demonstrate strong performance, validating both the utility of the dataset and the effectiveness of the model design.

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

TL;DR

The paper addresses the reliability of multimodal reasoning in vision-language models by moving beyond text-only chains of thought to an image-interactive paradigm. It introduces DeepSketcher, a two-part suite: a code-rendered dataset with interleaved image–text reasoning and a self-contained model that reasons and edits visual embeddings directly, removing the need for external tools. Through a three-phase training strategy and an agentic data-collection pipeline, DeepSketcher demonstrates strong performance gains across multiple multimodal benchmarks, especially in geometry, counting, and logic-related tasks, while maintaining robustness through ablation studies. The work reduces grounding noise and enables flexible, tool-free visual reasoning, offering a practical pathway toward more reliable and transparent multimodal intelligence in constrained domains with open-source resources.

Abstract

The "thinking with images" paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and more faithful multimodal reasoning. As an emerging paradigm, however, it still leaves substantial room for exploration in data construction accuracy, structural design, and broader application scenarios, which offer rich opportunities for advancing multimodal reasoning. To further advance this line of work, we present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model. The dataset contains 31k chain-of-thought (CoT) reasoning trajectories with diverse tool calls and resulting edited images, covering a wide range of data types and manipulation instructions with high annotation accuracy. Building on this resource, we design a model that performs interleaved image-text reasoning and natively generates "visual thoughts" by operating directly in the visual embedding space, rather than invoking external tools and repeatedly re-encoding generated images. This design enables tool-free and more flexible "thinking with images". Extensive experiments on multimodal reasoning benchmarks demonstrate strong performance, validating both the utility of the dataset and the effectiveness of the model design.

Paper Structure

This paper contains 37 sections, 8 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: In code space (right), edits are specified through rendering code, offering precision and reproducibility. In contrast, grounding-based manipulation (bounding box predicted by GPT-5 openai2025gpt5) and generation-based manipulation (image generated by Nano-Banana nanobanana) often yield noisy results, underscoring their limitations in stability and controllability.
  • Figure 2: Disciplinary coverage of our dataset.
  • Figure 3: Wordcloud of visual manipulations.
  • Figure 4: Architecture of the proposed DeepSketcher model. A query $Q$ and initial image $I_0$ are encoded into the vision–language model, producing reasoning tokens $R_t$ and edit instructions $Act_t$. The Embedding Editor manipulates visual embeddings directly, supervised by code-rendered ground-truth edits, and inserts updated embeddings back into the VLM context. This process yields interleaved reasoning and visual manipulation traces, ultimately producing the final answer.
  • Figure 5: Difference map visualizations. Each example shows the input image (left), the programmatic rendering (available only in Indicator-500) (middle), and the difference map between the embedding editor output and the original visual embedding (right).
  • ...and 6 more figures