Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

Lingzhuang Sun; Yuxia Zhu; Ruitong Liu; Hao Liang; Zheng Sun; Caijun Jia; Honghao He; Yuchen Wu; Siyuan Li; Jingxuan Wei; Xiangxiang Zhang; Bihui Yu; Wentao Zhang

Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, Hao Liang, Zheng Sun, Caijun Jia, Honghao He, Yuchen Wu, Siyuan Li, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang

TL;DR

Canvas-CoT replaces linear, immutable text-based reasoning with a mutable, DOM-backed external state to ground multimodal reasoning. By enabling atomic CRUD updates on a DOM substrate and coupling it with a rendering-based critique loop, the approach reduces token overhead, mitigates error proliferation, and provides explicit visual grounding for spatial tasks. Across VCode, RBench-V, and MathVista, Canvas-CoT demonstrates superior accuracy and robustness compared to traditional CoT, Tree-of-Thought, and other baselines, highlighting the practical value of external structured state for complex reasoning. The work introduces formal mechanisms (ID-addressable nodes, deterministic parsing, non-monotonic state transitions) and rigorous ablations, establishing a foundation for vision-language reasoning that can be extended to broader high-dimensional domains.

Abstract

While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the "ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.

Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

TL;DR

Abstract

Paper Structure (51 sections, 14 equations, 10 figures, 13 tables, 1 algorithm)

This paper contains 51 sections, 14 equations, 10 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Canvas-CoT Methodology
Preliminaries
Task Definition
The Reasoning Substrate: The DOM State
The Action Space: CRUD Operations
State Initialization
The Iterative Reasoning Cycle
Atomic Reasoning and Action Generation
Non-monotonic State Transition
Visual Grounding and Adversarial Critique
Recurrent Context Optimization
Termination
DOM Tree Manipulation
...and 36 more sections

Figures (10)

Figure 1: Paradigm Shift: Linear Textual Narration to Stateful Visual Modeling. (a): Text-CoT exhibits latent constraint hallucination. An implicit horizontal sliding prior mislocates the Instantaneous Center of Rotation, deriving a false geometry $IB=R\sqrt{3}$ obscured by linear algebraic steps. (b): Canvas-CoT externalizes this assumption via the DOM. The Rendering-Critique loop flags spatial conflicts, triggering a replace_element operation to realign the vector, and correct the ICR. This restores the valid geometry $IB=R$, followed by insert_element to finalize auxiliary lines and point $C$.
Figure 2: Overview of the Canvas-CoT Pipeline.
Figure 3: VCode Bench Token.
Figure 4: RBench-V Token.
Figure 5: SigLip Score of Gemini on VCode benchmark.
...and 5 more figures

Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

TL;DR

Abstract

Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

Authors

TL;DR

Abstract

Table of Contents

Figures (10)