Table of Contents
Fetching ...

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

Jing Zhu, Yuhang Zhou, Shengyi Qian, Zhongmou He, Tong Zhao, Neil Shah, Danai Koutra

TL;DR

MM-Graph introduces a comprehensive benchmark that systematically combines visual and textual node features with graph structure across seven real-world datasets to evaluate multimodal graph learning. It standardizes evaluation by providing uniform GNNs, KGEs, feature encoders, dataloader, and evaluators, enabling fair comparisons between unimodal and multimodal approaches. Key findings show that current multimodal GNNs often underperform conventional GNNs due to integration challenges, but aligned feature encoders and multimodal representations can yield meaningful performance gains, especially when visual data complements text. The benchmark emphasizes the importance of visual information in real-world graph tasks and establishes a foundation for developing more effective multimodal graph learning algorithms.

Abstract

Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structure and its potential for improving performance in downstream tasks remains an underexplored area. To address this critical gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), a pioneering benchmark that incorporates both visual and textual information into graph learning tasks. MM-GRAPH extends beyond existing text-attributed graph benchmarks, offering a more comprehensive evaluation framework for multimodal graph learning Our benchmark comprises seven diverse datasets of varying scales (ranging from thousands to millions of edges), designed to assess algorithms across different tasks in real-world scenarios. These datasets feature rich multimodal node attributes, including visual data, which enables a more holistic evaluation of various graph learning frameworks in complex, multimodal environments. To support advancements in this emerging field, we provide an extensive empirical study on various graph learning frameworks when presented with features from multiple modalities, particularly emphasizing the impact of visual information. This study offers valuable insights into the challenges and opportunities of integrating visual data into graph learning.

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

TL;DR

MM-Graph introduces a comprehensive benchmark that systematically combines visual and textual node features with graph structure across seven real-world datasets to evaluate multimodal graph learning. It standardizes evaluation by providing uniform GNNs, KGEs, feature encoders, dataloader, and evaluators, enabling fair comparisons between unimodal and multimodal approaches. Key findings show that current multimodal GNNs often underperform conventional GNNs due to integration challenges, but aligned feature encoders and multimodal representations can yield meaningful performance gains, especially when visual data complements text. The benchmark emphasizes the importance of visual information in real-world graph tasks and establishes a foundation for developing more effective multimodal graph learning algorithms.

Abstract

Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structure and its potential for improving performance in downstream tasks remains an underexplored area. To address this critical gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), a pioneering benchmark that incorporates both visual and textual information into graph learning tasks. MM-GRAPH extends beyond existing text-attributed graph benchmarks, offering a more comprehensive evaluation framework for multimodal graph learning Our benchmark comprises seven diverse datasets of varying scales (ranging from thousands to millions of edges), designed to assess algorithms across different tasks in real-world scenarios. These datasets feature rich multimodal node attributes, including visual data, which enables a more holistic evaluation of various graph learning frameworks in complex, multimodal environments. To support advancements in this emerging field, we provide an extensive empirical study on various graph learning frameworks when presented with features from multiple modalities, particularly emphasizing the impact of visual information. This study offers valuable insights into the challenges and opportunities of integrating visual data into graph learning.

Paper Structure

This paper contains 22 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: (RQ2) Multimodal GNNs underperforms conventional GNNs. We compare the best performance of multimodal GNNs (MMGCN/MGAT) and conventional GNNs (SAGE, GCN, BUDDY). Conventional GNNs consistently perform better across datasets, which justifies the importance of building MM-Graph and calls for better multimodal GNN designs.
  • Figure 2: (RQ3) Feature alignment is important. We compare the performance of various feature encoders, find aligned features, e.g., CLIP and Imagebind result in much better performance compared with unaligned features on Amazon-Sports and Amazon-Cloth. Among them, Imagebind performs the best across backbones. This indicates the importance of using aligned features on these datasets.
  • Figure 3: (RQ4) Multimodal features are helpful for graph learning. Multimodal features perform better than text-only features across datasets and tasks, which justifies the necessity of introducing multimodal graph datasets.