Table of Contents
Fetching ...

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Sachit Menon, Richard Zemel, Carl Vondrick

TL;DR

This work introduces a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities, and shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning.

Abstract

When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

TL;DR

This work introduces a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities, and shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning.

Abstract

When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves accuracy, while whiteboard-of-thought enables up to accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.
Paper Structure (27 sections, 7 figures, 3 tables)

This paper contains 27 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: For queries that are trivial with visual reasoning, chain-of-thought (which produces the text on the left) can fail in surprising ways. Whiteboard-of-thought (which produces the code, image, and text on the right) provides an alternative to perform intermediate reasoning with images.
  • Figure 2: Example queries for each of the three ASCII understanding BIG-Bench tasks srivastava_beyond_2022 we consider, along with the WoT visualization for each.
  • Figure 3: The different forms of ASCII in the BIG-Bench ASCII Word Recognition task and the visualizations made by WoT. Note that 'Bubble' simply includes the word with some additional characters, and 'Doh' forms the shape of each letter out of itself. (CoT results: 'ascii', 'hello', 'discovered', 'meet', 'goodbye'.) Best viewed with zoom.
  • Figure 4: A qualitative breakdown of the sources of error for WoT evaluated on the ASCII MNIST task.
  • Figure 5: Example WoT visual for spatial navigation.
  • ...and 2 more figures