Table of Contents
Fetching ...

When Graph meets Multimodal: Benchmarking and Meditating on Multimodal Attributed Graphs Learning

Hao Yan, Chaozhuo Li, Jun Yin, Zhigang Yu, Weihao Han, Mingzheng Li, Zhengxin Zeng, Hao Sun, Senzhang Wang

TL;DR

MAGB introduces the first comprehensive benchmark suite for Multimodal Attributed Graphs (MAGs), combining five real-world graphs with textual and visual node attributes to study the interplay between multimodal information and graph topology. It systematically compares two MAG representation learning paradigms—GNN-as-Predictor and VLM-as-Predictor—revealing domain-dependent modality importance, the data-sensitivity of multimodal embeddings for GNNs, and strong zero-shot capabilities of Vision-Language Models with retrieval-enhanced prompts. The findings highlight practical guidance: modality contributions vary by domain, VLMs help balance modality biases and enable zero-shot inference, and graph structure remains crucial for link prediction; GRE generally aids LVLMs but can complicate zero-shot outputs. The authors provide public access to the MAGB dataset and evaluation pipeline, aiming to standardize MAG research and foster cross-disciplinary collaboration across graph, NLP, and computer vision communities.

Abstract

Multimodal Attributed Graphs (MAGs) are ubiquitous in real-world applications, encompassing extensive knowledge through multimodal attributes attached to nodes (e.g., texts and images) and topological structure representing node interactions. Despite its potential to advance diverse research fields like social networks and e-commerce, MAG representation learning (MAGRL) remains underexplored due to the lack of standardized datasets and evaluation frameworks. In this paper, we first propose MAGB, a comprehensive MAG benchmark dataset, featuring curated graphs from various domains with both textual and visual attributes. Based on MAGB dataset, we further systematically evaluate two mainstream MAGRL paradigms: $\textit{GNN-as-Predictor}$, which integrates multimodal attributes via Graph Neural Networks (GNNs), and $\textit{VLM-as-Predictor}$, which harnesses Vision Language Models (VLMs) for zero-shot reasoning. Extensive experiments on MAGB reveal following critical insights: $\textit{(i)}$ Modality significances fluctuate drastically with specific domain characteristics. $\textit{(ii)}$ Multimodal embeddings can elevate the performance ceiling of GNNs. However, intrinsic biases among modalities may impede effective training, particularly in low-data scenarios. $\textit{(iii)}$ VLMs are highly effective at generating multimodal embeddings that alleviate the imbalance between textual and visual attributes. These discoveries, which illuminate the synergy between multimodal attributes and graph topologies, contribute to reliable benchmarks, paving the way for future MAG research. The MAGB dataset and evaluation pipeline are publicly available at https://github.com/sktsherlock/MAGB.

When Graph meets Multimodal: Benchmarking and Meditating on Multimodal Attributed Graphs Learning

TL;DR

MAGB introduces the first comprehensive benchmark suite for Multimodal Attributed Graphs (MAGs), combining five real-world graphs with textual and visual node attributes to study the interplay between multimodal information and graph topology. It systematically compares two MAG representation learning paradigms—GNN-as-Predictor and VLM-as-Predictor—revealing domain-dependent modality importance, the data-sensitivity of multimodal embeddings for GNNs, and strong zero-shot capabilities of Vision-Language Models with retrieval-enhanced prompts. The findings highlight practical guidance: modality contributions vary by domain, VLMs help balance modality biases and enable zero-shot inference, and graph structure remains crucial for link prediction; GRE generally aids LVLMs but can complicate zero-shot outputs. The authors provide public access to the MAGB dataset and evaluation pipeline, aiming to standardize MAG research and foster cross-disciplinary collaboration across graph, NLP, and computer vision communities.

Abstract

Multimodal Attributed Graphs (MAGs) are ubiquitous in real-world applications, encompassing extensive knowledge through multimodal attributes attached to nodes (e.g., texts and images) and topological structure representing node interactions. Despite its potential to advance diverse research fields like social networks and e-commerce, MAG representation learning (MAGRL) remains underexplored due to the lack of standardized datasets and evaluation frameworks. In this paper, we first propose MAGB, a comprehensive MAG benchmark dataset, featuring curated graphs from various domains with both textual and visual attributes. Based on MAGB dataset, we further systematically evaluate two mainstream MAGRL paradigms: , which integrates multimodal attributes via Graph Neural Networks (GNNs), and , which harnesses Vision Language Models (VLMs) for zero-shot reasoning. Extensive experiments on MAGB reveal following critical insights: Modality significances fluctuate drastically with specific domain characteristics. Multimodal embeddings can elevate the performance ceiling of GNNs. However, intrinsic biases among modalities may impede effective training, particularly in low-data scenarios. VLMs are highly effective at generating multimodal embeddings that alleviate the imbalance between textual and visual attributes. These discoveries, which illuminate the synergy between multimodal attributes and graph topologies, contribute to reliable benchmarks, paving the way for future MAG research. The MAGB dataset and evaluation pipeline are publicly available at https://github.com/sktsherlock/MAGB.

Paper Structure

This paper contains 36 sections, 1 equation, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Illustration of a Multimodal Attributed Graph example. The left part is the MAG topological structure and the right part presents the multimodal attributes in detail.
  • Figure 2: Overview of GNN-as-Predictor. The attribute representations generated by modality encoders serve as the initial node features. The GNN predictor addresses various downstre am tasks based on the multimodal representations.
  • Figure 3: Overview of VLM-as-Predictor. GRE$^k_m$ strategy first samples $k$ neighbors, and then prompts VLM with associated multimodal attributes via natural language instructions.
  • Figure 4: TSNE visualization of different embeddings on Reddit-M dataset.
  • Figure 5: Time Consuming of LLaMA-3.2 11B Vision Model.
  • ...and 7 more figures