Table of Contents
Fetching ...

Zero-Shot Relational Learning for Multimodal Knowledge Graphs

Rui Cai, Shichao Pei, Xiangliang Zhang

TL;DR

This work tackles zero-shot relational learning in multimodal knowledge graphs, where new relations must be inferred without training triples. It introduces MRE (Multimodal Relation Extrapolation), an end-to-end framework composed of a Multimodal Learner, a Structure Consolidator, and a Relational Embedding Generator to fuse image/text signals with KG topology and generate embeddings for unseen relations. The Multimodal Learner aligns visual and textual modalities via a masked autoencoder, the Structure Consolidator injects structural KG information through a GNN, and the Relational Embedding Generator employs a GAN-based objective to map relation descriptions to embeddings, enabling zero-shot inference. Across FB15K-237-ZS, DB15K-ZS, and WN18-IMG-ZS, MRE outperforms strong baselines, demonstrating that multimodal signals plus structural context substantially improve extrapolation of unseen relations. The approach advances practical KG maintenance by enabling plausible reasoning for newly discovered relations without training triples, with a noted direction for future work on leveraging multiple images per entity.

Abstract

Relational learning is an essential task in the domain of knowledge representation, particularly in knowledge graph completion (KGC). While relational learning in traditional single-modal settings has been extensively studied, exploring it within a multimodal KGC context presents distinct challenges and opportunities. One of the major challenges is inference on newly discovered relations without any associated training data. This zero-shot relational learning scenario poses unique requirements for multimodal KGC, i.e., utilizing multimodality to facilitate relational learning.However, existing works fail to support the leverage of multimodal information and leave the problem unexplored. In this paper, we propose a novel end-to-end framework, consisting of three components, i.e., multimodal learner, structure consolidator, and relation embedding generator, to integrate diverse multimodal information and knowledge graph structures to facilitate the zero-shot relational learning. Evaluation results on three multimodal knowledge graphs demonstrate the superior performance of our proposed method.

Zero-Shot Relational Learning for Multimodal Knowledge Graphs

TL;DR

This work tackles zero-shot relational learning in multimodal knowledge graphs, where new relations must be inferred without training triples. It introduces MRE (Multimodal Relation Extrapolation), an end-to-end framework composed of a Multimodal Learner, a Structure Consolidator, and a Relational Embedding Generator to fuse image/text signals with KG topology and generate embeddings for unseen relations. The Multimodal Learner aligns visual and textual modalities via a masked autoencoder, the Structure Consolidator injects structural KG information through a GNN, and the Relational Embedding Generator employs a GAN-based objective to map relation descriptions to embeddings, enabling zero-shot inference. Across FB15K-237-ZS, DB15K-ZS, and WN18-IMG-ZS, MRE outperforms strong baselines, demonstrating that multimodal signals plus structural context substantially improve extrapolation of unseen relations. The approach advances practical KG maintenance by enabling plausible reasoning for newly discovered relations without training triples, with a noted direction for future work on leveraging multiple images per entity.

Abstract

Relational learning is an essential task in the domain of knowledge representation, particularly in knowledge graph completion (KGC). While relational learning in traditional single-modal settings has been extensively studied, exploring it within a multimodal KGC context presents distinct challenges and opportunities. One of the major challenges is inference on newly discovered relations without any associated training data. This zero-shot relational learning scenario poses unique requirements for multimodal KGC, i.e., utilizing multimodality to facilitate relational learning.However, existing works fail to support the leverage of multimodal information and leave the problem unexplored. In this paper, we propose a novel end-to-end framework, consisting of three components, i.e., multimodal learner, structure consolidator, and relation embedding generator, to integrate diverse multimodal information and knowledge graph structures to facilitate the zero-shot relational learning. Evaluation results on three multimodal knowledge graphs demonstrate the superior performance of our proposed method.
Paper Structure (26 sections, 15 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 15 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: A toy example to illustrate that new relations emerge in the evolution of a multimodal knowledge graph. The MMKG in $t_{0}$ has two branches. After $t_{1}$, two new relations emerge and should be added to the MMKG but without any associated triples.
  • Figure 2: Training pipeline of MRE. The image-and-text pairs of entities are first masked and aligned through a reconstruction procedure at the Multimodal Learner. Then multimodal pairs are unmasked and the cls tokens, obtained after Joint Encoder' encoding process, are initialized in the GNN Encoder and fused with KG's topology at the Structure Consolidator. Relation Embedding Generator encodes and generates relation embeddings based on relation descriptions.
  • Figure 3: Different $R_s$ and $R_u$ spilt ratios and result comparison of MRE and IMF+ZSGAN regarding $\textit{MRR}$ and $\textit{Hit@1}$ in FB15K-237-ZS.
  • Figure 4: Comparing MRE with different masked ratio $m$ and different noise embedding size $d$. All results are derived from the trained models which achieve best results on validation datasets.
  • Figure 5: Two figures demonstrate the embedding distribution of selected 5 relations and their related entity pairs in two models after t-SNE analysis. Blue points are generated relation embeddings and red points are cluster centers of each cluster.