Towards Multimodal Graph Large Language Model
Xin Wang, Zeyang Zhang, Linxin Xiao, Haibo Chen, Chendi Ge, Wenwu Zhu
TL;DR
This paper argues that current multi-modal graph learning lacks generalization across diverse data and tasks and proposes Multi-modal Graph Large Language Models (MG-LLMs) as a unified paradigm. It formalizes a unified framework for multi-modal graph data, tasks, and models, emphasizing inherent multi-granularity and multi-scale properties, and outlines five core MG-LLM characteristics: unified structure-attribute space, diverse task handling, in-context learning, natural language interaction, and graph reasoning. The authors discuss a generative task formulation, compare transformation-based and MGNN/Graph-LLM approaches, and survey existing literature, highlighting challenges in vocabulary design, modality alignment, scalability, and open-set generation. They provide a roadmap with future directions, including novel graph vocabularies, modular architectures, robust prompting, and benchmark datasets to enable generalization across domains. The work aims to accelerate the development of native MG-LLMs capable of reasoning and generating over richly structured multimodal graphs, with potential impact spanning science, finance, and social networks.
Abstract
Multi-modal graphs, which integrate diverse multi-modal features and relations, are ubiquitous in real-world applications. However, existing multi-modal graph learning methods are typically trained from scratch for specific graph data and tasks, failing to generalize across various multi-modal graph data and tasks. To bridge this gap, we explore the potential of Multi-modal Graph Large Language Models (MG-LLM) to unify and generalize across diverse multi-modal graph data and tasks. We propose a unified framework of multi-modal graph data, task, and model, discovering the inherent multi-granularity and multi-scale characteristics in multi-modal graphs. Specifically, we present five key desired characteristics for MG-LLM: 1) unified space for multi-modal structures and attributes, 2) capability of handling diverse multi-modal graph tasks, 3) multi-modal graph in-context learning, 4) multi-modal graph interaction with natural language, and 5) multi-modal graph reasoning. We then elaborate on the key challenges, review related works, and highlight promising future research directions towards realizing these ambitious characteristics. Finally, we summarize existing multi-modal graph datasets pertinent for model training. We believe this paper can contribute to the ongoing advancement of the research towards MG-LLM for generalization across multi-modal graph data and tasks.
