Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models
Ruizhou Li, Haiyun Jiang
TL;DR
This paper introduces Graph2Vision, the first benchmark tailored for multi-graph reasoning with Vision-Language Models, covering four graph types (flowcharts, knowledge graphs, mind maps, route maps) and both homogeneous and heterogeneous groupings. It proposes a multi-dimensional evaluation framework focusing on graph parsing, instruction-following, and reasoning consistency, and validates the benchmark by evaluating several state-of-the-art VLMs and conducting fine-tuning on open-source models. The results reveal that current VLMs have notable potential but struggle with cross-graph integration, with RCC emerging as the most robust dimension and route maps posing distinct challenges. The work provides a principled step toward cross-modal graph intelligence and outlines scalable, domain-aware directions for future research, including scaling to larger models and expanding to additional graph modalities and applications.
Abstract
Recent advances in Vision-Language Models (VLMs) have shown promising capabilities in interpreting visualized graph data, offering a new perspective for graph-structured reasoning beyond traditional Graph Neural Networks (GNNs). However, existing studies focus primarily on single-graph reasoning, leaving the critical challenge of multi-graph joint reasoning underexplored. In this work, we introduce the first comprehensive benchmark designed to evaluate and enhance the multi-graph reasoning abilities of VLMs. Our benchmark covers four common graph types-knowledge graphs, flowcharts, mind maps, and route maps-and supports both homogeneous and heterogeneous graph groupings with tasks of increasing complexity. We evaluate several state-of-the-art VLMs under a multi-dimensional scoring framework that assesses graph parsing, reasoning consistency, and instruction-following accuracy. Additionally, we fine-tune multiple open-source models and observe consistent improvements, confirming the effectiveness of our dataset. This work provides a principled step toward advancing multi-graph understanding and reveals new opportunities for cross-modal graph intelligence.
