Table of Contents
Fetching ...

Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

Siqu Ou, Hongcheng Liu, Pingjie Wang, Yusheng Liao, Chuan Xuan, Yanfeng Wang, Yu Wang

TL;DR

This work tackles dynamic multimodal spatial reasoning by introducing GRASSLAND, a dynamic maze benchmark with two tasks (Maze Judgment and Maze Navigation) that expose limitations of existing MLLMs in evolving environments. It proposes Draft CoT, a reasoning paradigm that overlays textual thoughts with drafts on dynamic input images, and formalizes a training-free framework, Dynamic Draft Augmented Reasoning (D2R), to integrate these drafts into model reasoning without fine-tuning. Across multiple MLLMs, D2R consistently improves performance on dynamic reasoning tasks, with robustness across task difficulty and model capability, and approaches the efficacy of Draft CoT with ground-truth drafts. The work provides a scalable, training-free path to enhance dynamic multimodal reasoning, offering practical implications for real-world robotics and navigation tasks where dynamic perception is essential.

Abstract

While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.

Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

TL;DR

This work tackles dynamic multimodal spatial reasoning by introducing GRASSLAND, a dynamic maze benchmark with two tasks (Maze Judgment and Maze Navigation) that expose limitations of existing MLLMs in evolving environments. It proposes Draft CoT, a reasoning paradigm that overlays textual thoughts with drafts on dynamic input images, and formalizes a training-free framework, Dynamic Draft Augmented Reasoning (D2R), to integrate these drafts into model reasoning without fine-tuning. Across multiple MLLMs, D2R consistently improves performance on dynamic reasoning tasks, with robustness across task difficulty and model capability, and approaches the efficacy of Draft CoT with ground-truth drafts. The work provides a scalable, training-free path to enhance dynamic multimodal reasoning, offering practical implications for real-world robotics and navigation tasks where dynamic perception is essential.

Abstract

While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.

Paper Structure

This paper contains 35 sections, 7 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: The demonstration of the Draft CoT with D2R. Compared to the spatial information gaps in language-centric CoT, and the incomplete dynamic information in static visual CoT, which only visualizes the input rather than the MLLM’s thought process, Draft CoT excels at dynamic spatial reasoning.
  • Figure 2: Illustration of the difference between our method and others. Direct prompting and language-centric CoT face significant limitations in dynamic spatial reasoning tasks without images. VAP can only generate static images based on agent prompts, without MLLM involvement for dynamic perception. MVOT requires MLLMs powerful in image generation by training on specialized datasets. In contrast, D2R marks the textual thought in the image as draft and integrates it into the Draft CoT, enhancing the MLLM's dynamic spatial reasoning ability without specific training.
  • Figure 3: Example of dynamic scenario sequence in GRASSLAND. The left part is the illustration of the dynamic images and grids in GRASSLAND, and the right part is the description of the two tasks.
  • Figure 4: Accuracy with different models and methods in the hard Maze Judgment task. GT denotes that this result is obtained by ground truth in the route.
  • Figure 5: Average accuracy of models for each choice using various methods in the Maze Judgment task.
  • ...and 3 more figures