When Graph meets Multimodal: Benchmarking and Meditating on Multimodal Attributed Graphs Learning
Hao Yan, Chaozhuo Li, Jun Yin, Zhigang Yu, Weihao Han, Mingzheng Li, Zhengxin Zeng, Hao Sun, Senzhang Wang
TL;DR
MAGB introduces the first comprehensive benchmark suite for Multimodal Attributed Graphs (MAGs), combining five real-world graphs with textual and visual node attributes to study the interplay between multimodal information and graph topology. It systematically compares two MAG representation learning paradigms—GNN-as-Predictor and VLM-as-Predictor—revealing domain-dependent modality importance, the data-sensitivity of multimodal embeddings for GNNs, and strong zero-shot capabilities of Vision-Language Models with retrieval-enhanced prompts. The findings highlight practical guidance: modality contributions vary by domain, VLMs help balance modality biases and enable zero-shot inference, and graph structure remains crucial for link prediction; GRE generally aids LVLMs but can complicate zero-shot outputs. The authors provide public access to the MAGB dataset and evaluation pipeline, aiming to standardize MAG research and foster cross-disciplinary collaboration across graph, NLP, and computer vision communities.
Abstract
Multimodal Attributed Graphs (MAGs) are ubiquitous in real-world applications, encompassing extensive knowledge through multimodal attributes attached to nodes (e.g., texts and images) and topological structure representing node interactions. Despite its potential to advance diverse research fields like social networks and e-commerce, MAG representation learning (MAGRL) remains underexplored due to the lack of standardized datasets and evaluation frameworks. In this paper, we first propose MAGB, a comprehensive MAG benchmark dataset, featuring curated graphs from various domains with both textual and visual attributes. Based on MAGB dataset, we further systematically evaluate two mainstream MAGRL paradigms: $\textit{GNN-as-Predictor}$, which integrates multimodal attributes via Graph Neural Networks (GNNs), and $\textit{VLM-as-Predictor}$, which harnesses Vision Language Models (VLMs) for zero-shot reasoning. Extensive experiments on MAGB reveal following critical insights: $\textit{(i)}$ Modality significances fluctuate drastically with specific domain characteristics. $\textit{(ii)}$ Multimodal embeddings can elevate the performance ceiling of GNNs. However, intrinsic biases among modalities may impede effective training, particularly in low-data scenarios. $\textit{(iii)}$ VLMs are highly effective at generating multimodal embeddings that alleviate the imbalance between textual and visual attributes. These discoveries, which illuminate the synergy between multimodal attributes and graph topologies, contribute to reliable benchmarks, paving the way for future MAG research. The MAGB dataset and evaluation pipeline are publicly available at https://github.com/sktsherlock/MAGB.
