Table of Contents
Fetching ...

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

Yunxin Li, Baotian Hu, Haoyuan Shi, Wei Wang, Longyue Wang, Min Zhang

TL;DR

VisionGraph introduces a multimodal benchmark for graph theory problems in visual context, aiming to quantify graphical structure understanding and multi-step reasoning in large multimodal models. It extends NLGraph by incorporating visual graphs generated via NetworkX, eight graph problems across difficulty levels, and two perception questions, establishing a rigorous evaluation framework. The authors compare a range of LMMs, reveal limitations in graphical perception, and propose Description-Program-Reasoning (DPR), a hybrid natural-language and code approach that, when paired with external tools, substantially improves multi-step reasoning—particularly for GPT-4V. The work provides a valuable benchmark and actionable prompting and tooling strategies to advance visual-mathematical reasoning in practical domains like robotics planning and biology.

Abstract

Large Multimodal Models (LMMs) have achieved impressive success in visual understanding and reasoning, remarkably improving the performance of mathematical reasoning in a visual context. Yet, a challenging type of visual math lies in the multimodal graph theory problem, which demands that LMMs understand the graphical structures accurately and perform multi-step reasoning on the visual graph. Additionally, exploring multimodal graph theory problems will lead to more effective strategies in fields like biology, transportation, and robotics planning. To step forward in this direction, we are the first to design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems. It encompasses eight complex graph problem tasks, from connectivity to shortest path problems. Subsequently, we present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes through graphical structure description generation and algorithm-aware multi-step reasoning. Our extensive study shows that 1) GPT-4V outperforms Gemini Pro in multi-step graph reasoning; 2) All LMMs exhibit inferior perception accuracy for graphical structures, whether in zero/few-shot settings or with supervised fine-tuning (SFT), which further affects problem-solving performance; 3) DPR significantly improves the multi-step graph reasoning capabilities of LMMs and the GPT-4V (DPR) agent achieves SOTA performance.

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

TL;DR

VisionGraph introduces a multimodal benchmark for graph theory problems in visual context, aiming to quantify graphical structure understanding and multi-step reasoning in large multimodal models. It extends NLGraph by incorporating visual graphs generated via NetworkX, eight graph problems across difficulty levels, and two perception questions, establishing a rigorous evaluation framework. The authors compare a range of LMMs, reveal limitations in graphical perception, and propose Description-Program-Reasoning (DPR), a hybrid natural-language and code approach that, when paired with external tools, substantially improves multi-step reasoning—particularly for GPT-4V. The work provides a valuable benchmark and actionable prompting and tooling strategies to advance visual-mathematical reasoning in practical domains like robotics planning and biology.

Abstract

Large Multimodal Models (LMMs) have achieved impressive success in visual understanding and reasoning, remarkably improving the performance of mathematical reasoning in a visual context. Yet, a challenging type of visual math lies in the multimodal graph theory problem, which demands that LMMs understand the graphical structures accurately and perform multi-step reasoning on the visual graph. Additionally, exploring multimodal graph theory problems will lead to more effective strategies in fields like biology, transportation, and robotics planning. To step forward in this direction, we are the first to design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems. It encompasses eight complex graph problem tasks, from connectivity to shortest path problems. Subsequently, we present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes through graphical structure description generation and algorithm-aware multi-step reasoning. Our extensive study shows that 1) GPT-4V outperforms Gemini Pro in multi-step graph reasoning; 2) All LMMs exhibit inferior perception accuracy for graphical structures, whether in zero/few-shot settings or with supervised fine-tuning (SFT), which further affects problem-solving performance; 3) DPR significantly improves the multi-step graph reasoning capabilities of LMMs and the GPT-4V (DPR) agent achieves SOTA performance.
Paper Structure (15 sections, 11 figures, 5 tables)

This paper contains 15 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Two cases of utilizing GPT-4V (Date: 2024.01.17) to answer easy graph understanding and reasoning questions. We highlight the incorrect responses using the red words.
  • Figure 2: An overview of various multimodal graph theory problems in the VisionGraph benchmark.
  • Figure 3: It illustrates two scenarios in augmented graph understanding data: 1) Overall Edge Recognition, focusing on identifying and interpreting the connections between nodes; 2) Edge-relevant VQA, which addresses questions specifically related to the visual aspects and significance of the graph's edges and nodes.
  • Figure 4: Overview of the prompting approach for GPT-4V (DPR).
  • Figure 5: An Illustration of the designed GPT-4V (DPR) Agent.
  • ...and 6 more figures