Table of Contents
Fetching ...

Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

Zhiqing Cui, Jiahao Yuan, Hanqing Wang, Yanshu Li, Chenxu Du, Zhenglong Ding

TL;DR

This work addresses the challenge of converting raster scientific diagrams into editable, executable symbolic representations by recovering underlying structure. It introduces Draw with Thought (DwT), a training-free, cognitively inspired two-stage framework that guides Multimodal LLMs through Coarse-to-Fine Planning and Structure-Aware Code Generation to produce mxGraph XML from diagrams. A new Plot2XML benchmark with 247 real diagrams and gold-standard XML annotations enables rigorous evaluation, where DwT achieves superior semantic fidelity, layout accuracy, and XML validity across eight MLLMs, especially on hard diagrams, as confirmed by human judgments. The approach offers a scalable path to diagram-to-code conversion without finetuning, with potential extensions to broader image-to-structure tasks and efficient long-context adaptation.

Abstract

Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.

Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

TL;DR

This work addresses the challenge of converting raster scientific diagrams into editable, executable symbolic representations by recovering underlying structure. It introduces Draw with Thought (DwT), a training-free, cognitively inspired two-stage framework that guides Multimodal LLMs through Coarse-to-Fine Planning and Structure-Aware Code Generation to produce mxGraph XML from diagrams. A new Plot2XML benchmark with 247 real diagrams and gold-standard XML annotations enables rigorous evaluation, where DwT achieves superior semantic fidelity, layout accuracy, and XML validity across eight MLLMs, especially on hard diagrams, as confirmed by human judgments. The approach offers a scalable path to diagram-to-code conversion without finetuning, with potential extensions to broader image-to-structure tasks and efficient long-context adaptation.

Abstract

Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.

Paper Structure

This paper contains 24 sections, 10 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our Draw with Thought. The framework for scientific diagram generation processes input diagrams through Coarse-to-Fine Planning (perceptual structuring and semantic layout planning) followed by Structure-Aware Code Generation to produce editable, high-fidelity diagram representations.
  • Figure 2: Complexity analysis of scientific diagrams. Visualized as a radar chart over five dimensions: Connection Complexity, Graphical Complexity, Color Effects, Text Annotation, and Special Elements. The chart compares three difficulty levels—Easy (blue, 85 diagrams, 34.4%), Medium (coral, 98 diagrams, 39.7%), and Hard (green, 64 diagrams, 25.9%)—highlighting distinct structural patterns across complexity axes.
  • Figure 3: Qualitative Comparison Between DwT and State-of-the-Art MLLMs. We visualize the output results. We observe that our model can understand the layout and the content of the image well, achieving better performance.
  • Figure 4: Comparison of diagram reconstruction results across different approaches.