Table of Contents
Fetching ...

Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models

Ruizhou Li, Haiyun Jiang

TL;DR

This paper introduces Graph2Vision, the first benchmark tailored for multi-graph reasoning with Vision-Language Models, covering four graph types (flowcharts, knowledge graphs, mind maps, route maps) and both homogeneous and heterogeneous groupings. It proposes a multi-dimensional evaluation framework focusing on graph parsing, instruction-following, and reasoning consistency, and validates the benchmark by evaluating several state-of-the-art VLMs and conducting fine-tuning on open-source models. The results reveal that current VLMs have notable potential but struggle with cross-graph integration, with RCC emerging as the most robust dimension and route maps posing distinct challenges. The work provides a principled step toward cross-modal graph intelligence and outlines scalable, domain-aware directions for future research, including scaling to larger models and expanding to additional graph modalities and applications.

Abstract

Recent advances in Vision-Language Models (VLMs) have shown promising capabilities in interpreting visualized graph data, offering a new perspective for graph-structured reasoning beyond traditional Graph Neural Networks (GNNs). However, existing studies focus primarily on single-graph reasoning, leaving the critical challenge of multi-graph joint reasoning underexplored. In this work, we introduce the first comprehensive benchmark designed to evaluate and enhance the multi-graph reasoning abilities of VLMs. Our benchmark covers four common graph types-knowledge graphs, flowcharts, mind maps, and route maps-and supports both homogeneous and heterogeneous graph groupings with tasks of increasing complexity. We evaluate several state-of-the-art VLMs under a multi-dimensional scoring framework that assesses graph parsing, reasoning consistency, and instruction-following accuracy. Additionally, we fine-tune multiple open-source models and observe consistent improvements, confirming the effectiveness of our dataset. This work provides a principled step toward advancing multi-graph understanding and reveals new opportunities for cross-modal graph intelligence.

Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models

TL;DR

This paper introduces Graph2Vision, the first benchmark tailored for multi-graph reasoning with Vision-Language Models, covering four graph types (flowcharts, knowledge graphs, mind maps, route maps) and both homogeneous and heterogeneous groupings. It proposes a multi-dimensional evaluation framework focusing on graph parsing, instruction-following, and reasoning consistency, and validates the benchmark by evaluating several state-of-the-art VLMs and conducting fine-tuning on open-source models. The results reveal that current VLMs have notable potential but struggle with cross-graph integration, with RCC emerging as the most robust dimension and route maps posing distinct challenges. The work provides a principled step toward cross-modal graph intelligence and outlines scalable, domain-aware directions for future research, including scaling to larger models and expanding to additional graph modalities and applications.

Abstract

Recent advances in Vision-Language Models (VLMs) have shown promising capabilities in interpreting visualized graph data, offering a new perspective for graph-structured reasoning beyond traditional Graph Neural Networks (GNNs). However, existing studies focus primarily on single-graph reasoning, leaving the critical challenge of multi-graph joint reasoning underexplored. In this work, we introduce the first comprehensive benchmark designed to evaluate and enhance the multi-graph reasoning abilities of VLMs. Our benchmark covers four common graph types-knowledge graphs, flowcharts, mind maps, and route maps-and supports both homogeneous and heterogeneous graph groupings with tasks of increasing complexity. We evaluate several state-of-the-art VLMs under a multi-dimensional scoring framework that assesses graph parsing, reasoning consistency, and instruction-following accuracy. Additionally, we fine-tune multiple open-source models and observe consistent improvements, confirming the effectiveness of our dataset. This work provides a principled step toward advancing multi-graph understanding and reveals new opportunities for cross-modal graph intelligence.

Paper Structure

This paper contains 47 sections, 11 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: (a) Examples of the four types of graphs included in our benchmark: knowledge graphs, mind maps, flowcharts, and route maps. (b) An example sample from our benchmark, consisting of a set of related graphs, a corresponding instruction, and its reference answer.
  • Figure 2: Overview of the benchmark construction pipeline. The process includes: (1) collecting diverse graph images across four types; (2) grouping them into semantically or structurally coherent sets using ColBERT-inspired matching or subgraph splitting; and (3) generating instruction-response pairs via GPT-4o, followed by manual review and refinement to ensure clarity and reasoning quality.
  • Figure 3: Score distribution histograms across three evaluation dimensions (GPA, RCC and IRA) for each of the five models. Each subplot shows the frequency of scores ranging from 1 to 5, where bars are colored by evaluation dimension.
  • Figure 4: Scatter plot illustrating the consistency between automatic evaluation scores and human ratings across a 10% randomly sampled subset of the evaluation dataset. Each point represents the frequency of a specific score pair (Human, GPT), with color intensity indicating occurrence frequency. The diagonal dotted line denotes perfect agreement, while the red dashed line shows the linear regression trend, representing the alignment between automatic evaluation scores and human scores—closer proximity to the diagonal indicates stronger consistency.
  • Figure 5: Score distributions for each VLM across three evaluation dimensions: GPA, RCC and IRA. Each row corresponds to a specific evaluation dimension, and each column to a different model. For each score level (1–5), lighter bars indicate results before finetuning, while darker bars represent results after finetuning.
  • ...and 18 more figures