GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Jiajin Liu; Dongzhe Fan; Chuanhao Ji; Daochen Zha; Qiaoyu Tan

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Jiajin Liu, Dongzhe Fan, Chuanhao Ji, Daochen Zha, Qiaoyu Tan

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision-language models as a new foundation for multimodal graph learning. The benchmark code is publicly available at https://github.com/oamyjin/GraphVLM.

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Abstract

Paper Structure (20 sections, 7 equations, 5 figures, 12 tables)

This paper contains 20 sections, 7 equations, 5 figures, 12 tables.

Introduction
Formulations and Background
Multimodal Graph Definitions
Multimodal Graph Learning Methods
Benchmark Design
VLM-as-Encoder
VLM-as-Aligner
VLM-as-Predictor
Experiments
Datasets
Impact of VLM-as-Encoder (RQ1)
Adaptation of VLM-as-Aligner (RQ2)
Effectiveness of VLM-as-Predictor (RQ3)
Comparative Analysis of VLM Roles (RQ4)
Conclusion
...and 5 more sections

Figures (5)

Figure 1: Overview of the GraphVLM benchmark with a timeline of graph learning research. Existing graph learning methods are categorized into three groups based on the prediction backbone. The bottom-left corner illustrates the functional roles VLMs play in each category.
Figure 2: Average accuracy (%) across six datasets under different modalities for GNN-based methods, using CLIP as the encoder.
Figure 3: The impact of different VLM-based methods with multimodal structure information at the prompt level for node classification tasks, using Qwen-VL-7B as the backbone.
Figure 4: Performance comparison of three VLM roles.
Figure 5: Extra performance gain of structure-awareness.

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Abstract

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Authors

Abstract

Table of Contents

Figures (5)