Table of Contents
Fetching ...

Chain of Images for Intuitively Reasoning

Fanxu Meng, Haotong Yang, Yiding Wang, Muhan Zhang

TL;DR

Addresses the limitation of text-only reasoning in LLMs by introducing Chain-of-Images (CoI), which uses symbolic image intermediates generated by SyMLLM to support reasoning. The approach is validated on Geometry, Chess, and Commonsense tasks via the CoIEval benchmark, showing substantial gains over pure-text Chain-of-Thought baselines. The work provides a concrete symbolic-image framework and a dedicated evaluation dataset to drive future multimodal reasoning capabilities in large-scale models. This has practical implications for enhancing interpretability and reliability of AI reasoning in domains requiring visual and relational reasoning.

Abstract

The human brain is naturally equipped to comprehend and interpret visual information rapidly. When confronted with complex problems or concepts, we use flowcharts, sketches, and diagrams to aid our thought process. Leveraging this inherent ability can significantly enhance logical reasoning. However, current Large Language Models (LLMs) do not utilize such visual intuition to help their thinking. Even the most advanced version language models (e.g., GPT-4V and LLaVA) merely align images into textual space, which means their reasoning processes remain purely verbal. To mitigate such limitations, we present a Chain of Images (CoI) approach, which can convert complex language reasoning problems to simple pattern recognition by generating a series of images as intermediate representations. Furthermore, we have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving. Based on this dataset, we aim to construct a benchmark to assess the capability of future multimodal large-scale models to leverage images for reasoning. In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions and accepts both text and image as input. Experiments on Geometry, Chess and Common Sense tasks sourced from the CoI evaluation dataset show that CoI improves performance significantly over the pure-language Chain of Thoughts (CoT) baselines. The code is available at https://github.com/GraphPKU/CoI.

Chain of Images for Intuitively Reasoning

TL;DR

Addresses the limitation of text-only reasoning in LLMs by introducing Chain-of-Images (CoI), which uses symbolic image intermediates generated by SyMLLM to support reasoning. The approach is validated on Geometry, Chess, and Commonsense tasks via the CoIEval benchmark, showing substantial gains over pure-text Chain-of-Thought baselines. The work provides a concrete symbolic-image framework and a dedicated evaluation dataset to drive future multimodal reasoning capabilities in large-scale models. This has practical implications for enhancing interpretability and reliability of AI reasoning in domains requiring visual and relational reasoning.

Abstract

The human brain is naturally equipped to comprehend and interpret visual information rapidly. When confronted with complex problems or concepts, we use flowcharts, sketches, and diagrams to aid our thought process. Leveraging this inherent ability can significantly enhance logical reasoning. However, current Large Language Models (LLMs) do not utilize such visual intuition to help their thinking. Even the most advanced version language models (e.g., GPT-4V and LLaVA) merely align images into textual space, which means their reasoning processes remain purely verbal. To mitigate such limitations, we present a Chain of Images (CoI) approach, which can convert complex language reasoning problems to simple pattern recognition by generating a series of images as intermediate representations. Furthermore, we have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving. Based on this dataset, we aim to construct a benchmark to assess the capability of future multimodal large-scale models to leverage images for reasoning. In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions and accepts both text and image as input. Experiments on Geometry, Chess and Common Sense tasks sourced from the CoI evaluation dataset show that CoI improves performance significantly over the pure-language Chain of Thoughts (CoT) baselines. The code is available at https://github.com/GraphPKU/CoI.
Paper Structure (11 sections, 10 figures, 4 tables)

This paper contains 11 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The simplified steps employing GPT-4 to calculate the number of intersection points using CoT are displayed at the bottom of the image. The whole processes of CoI are displayed at the right-top of the image. Two issues were identified in CoT: 1. Incorrect numerical values were used during the formula derivation, and 2. The existence of endpoints in the line segments was overlooked. By contrast, CoI easily identifies the number of intersection points in the image generated by SyMLLM.
  • Figure 2: Images play a pivotal role in many disciplines. We tend to imagine pictures to solve problems intuitively.
  • Figure 3: a) Given a uniform-modal model with a diffusion model as an image decoder, CoI can work under zero-shot mode. b) The symbolic response generated by SyMLLM given language instruction can be directly converted to an image.
  • Figure 4: the 3-shot prompt for building the CoIEval dataset.
  • Figure 5: This figure demonstrates the generation of new images by SDXL (center) and DALL·E 3 (right), using the captions derived from the original images (left).
  • ...and 5 more figures