Table of Contents
Fetching ...

Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey

Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo, Xiang Chen, Qian Li, Wen Zhang, Jiaoyan Chen, Yushan Zhu, Jiaqi Li, Xiaoze Liu, Jeff Z. Pan, Ningyu Zhang, Huajun Chen

TL;DR

This survey provides a rigorous, up-to-date synthesis of Knowledge Graphs in the context of multi-modal learning, organizing the landscape into KG-driven (KG4MM) and multi-modal knowledge graphs (MMKG/MM4KG) frameworks. It details KG and MMKG construction, outlines task taxonomies across understanding, generation, retrieval, and pretraining, and surveys representative methods for both KG-enhanced multi-modal learning and MMKG-driven learning. By evaluating datasets, benchmarks, and representative architectures, the paper highlights current challenges—such as long-tail knowledge, modality missingness, and cross-modal integration—and points to opportunities involving LLMs, structured pre-training, and AI-for-science applications. The work aims to provide a practical roadmap for researchers and practitioners seeking to advance knowledge-grounded, multi-modal AI systems across academia and industry.

Abstract

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the semantic web community's exploration into multi-modal dimensions unlocking new avenues for innovation. In this survey, we carefully review over 300 articles, focusing on KG-aware research in two principal aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into the MMKG realm. We begin by defining KGs and MMKGs, then explore their construction progress. Our review includes two primary task categories: KG-aware multi-modal learning tasks, such as Image Classification and Visual Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph Completion and Entity Alignment, highlighting specific research trajectories. For most of these tasks, we provide definitions, evaluation benchmarks, and additionally outline essential insights for conducting relevant research. Finally, we discuss current challenges and identify emerging trends, such as progress in Large Language Modeling and Multi-modal Pre-training strategies. This survey aims to serve as a comprehensive reference for researchers already involved in or considering delving into KG and multi-modal learning research, offering insights into the evolving landscape of MMKG research and supporting future work.

Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey

TL;DR

This survey provides a rigorous, up-to-date synthesis of Knowledge Graphs in the context of multi-modal learning, organizing the landscape into KG-driven (KG4MM) and multi-modal knowledge graphs (MMKG/MM4KG) frameworks. It details KG and MMKG construction, outlines task taxonomies across understanding, generation, retrieval, and pretraining, and surveys representative methods for both KG-enhanced multi-modal learning and MMKG-driven learning. By evaluating datasets, benchmarks, and representative architectures, the paper highlights current challenges—such as long-tail knowledge, modality missingness, and cross-modal integration—and points to opportunities involving LLMs, structured pre-training, and AI-for-science applications. The work aims to provide a practical roadmap for researchers and practitioners seeking to advance knowledge-grounded, multi-modal AI systems across academia and industry.

Abstract

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the semantic web community's exploration into multi-modal dimensions unlocking new avenues for innovation. In this survey, we carefully review over 300 articles, focusing on KG-aware research in two principal aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into the MMKG realm. We begin by defining KGs and MMKGs, then explore their construction progress. Our review includes two primary task categories: KG-aware multi-modal learning tasks, such as Image Classification and Visual Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph Completion and Entity Alignment, highlighting specific research trajectories. For most of these tasks, we provide definitions, evaluation benchmarks, and additionally outline essential insights for conducting relevant research. Finally, we discuss current challenges and identify emerging trends, such as progress in Large Language Modeling and Multi-modal Pre-training strategies. This survey aims to serve as a comprehensive reference for researchers already involved in or considering delving into KG and multi-modal learning research, offering insights into the evolving landscape of MMKG research and supporting future work.
Paper Structure (56 sections, 3 equations, 13 figures, 11 tables)

This paper contains 56 sections, 3 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Knowledge Graphs Meet Multi-modal Learning.
  • Figure 2: Comprehensive Overview of Integrating Knowledge Graphs with Multi-Modal Learning.
  • Figure 3: Roadmap for Multi-Modal Knowledge Graph (MMKG) construction and application in downstream multi-modal tasks.
  • Figure 4: Currently representative N-MMKG ontologies and corresponding MMKG examples using those ontologies.
  • Figure 5: Illustration of KG-based Visual Question Answering (VQA) (§ \ref{['sec:kgr']}) and Visual Referring Expressions (VRE) (§ \ref{['sec:kgret']}). To some extent, KG-based VRE can be viewed as an extension of KG-based VQA, incorporating an additional step of grounding answers.
  • ...and 8 more figures