Table of Contents
Fetching ...

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, Furu Wei

TL;DR

This work introduces Visualization-of-Thought (VoT) prompting to elicit the mind's-eye like visual thinking of large language models for spatial reasoning. By interleaving reasoning steps with grounded visualizations, VoT provides a visuospatial sketchpad that guides subsequent steps and grounds internal states in 2D representations. Across natural-language navigation, visual navigation, and visual tiling, VoT demonstrates substantial gains over zero-shot CoT and w/o Viz baselines, and even surpasses some multimodal models in these tasks. The study analyzes visual-state tracking behavior, the impact of visualization accuracy on final answers, and the scaling of VoT across model sizes, highlighting both the promise and current limitations of grounding LLM reasoning in internal visual imagery. Overall, VoT offers a scalable, zero-shot approach to enhance spatial reasoning in LLMs with potential extensions to broader modalities and real-world grounding.

Abstract

Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes at https://microsoft.github.io/visualization-of-thought

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

TL;DR

This work introduces Visualization-of-Thought (VoT) prompting to elicit the mind's-eye like visual thinking of large language models for spatial reasoning. By interleaving reasoning steps with grounded visualizations, VoT provides a visuospatial sketchpad that guides subsequent steps and grounds internal states in 2D representations. Across natural-language navigation, visual navigation, and visual tiling, VoT demonstrates substantial gains over zero-shot CoT and w/o Viz baselines, and even surpasses some multimodal models in these tasks. The study analyzes visual-state tracking behavior, the impact of visualization accuracy on final answers, and the scaling of VoT across model sizes, highlighting both the promise and current limitations of grounding LLM reasoning in internal visual imagery. Overall, VoT offers a scalable, zero-shot approach to enhance spatial reasoning in LLMs with potential extensions to broader modalities and real-world grounding.

Abstract

Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes at https://microsoft.github.io/visualization-of-thought
Paper Structure (43 sections, 9 equations, 24 figures, 7 tables)

This paper contains 43 sections, 9 equations, 24 figures, 7 tables.

Figures (24)

  • Figure 1: Humans can enhance their spatial awareness and inform decisions by creating mental images during the spatial reasoning process. Similarly, large language models (LLMs) can create internal mental images. We propose the VoT prompting to elicit the "mind's eye" of LLMs for spatial reasoning by visualizing their thoughts at each intermediate step.
  • Figure 2: Examples of a navigation map under different settings of $k$, with emoji of house indicating the starting point, and emoji of office indicating the destination.
  • Figure 3: Example of visual tiling with masked polyomino pieces. Variants of those polyomino pieces including rotation and reflection are not shown in this figure.
  • Figure 4: Examples of VoT prompting in three tasks, where LLM generates 2D grids as text-form mental images. The generated reasoning traces and visualizations form an interleaved sequence to track the state over time. The 2D grids in the input and responses are composed of special characters. Full responses could be found in Appendix \ref{['sec:a2']}.
  • Figure 5: tracking rate of different settings across all tasksk.
  • ...and 19 more figures