Table of Contents
Fetching ...

DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion

Wei Huang, Meiyu Liang, Peining Li, Xu Hou, Yawen Li, Junping Du, Zhe Xue, Zeli Guan

TL;DR

This paper tackles multimodal knowledge graph completion (MKGC) by reframing it as generative joint-distribution modeling. It introduces DiffusionCom, a diffusion-model-based framework that generates the joint distribution between $(head, relation)$ and candidate tails, conditioned on a structure-aware multimodal encoder, Structure-MKGformer. The encoder fuses textual, visual, and graph-structural cues via MGAT and adaptive fusion, while a conditional denoiser guides the reverse diffusion; training optimizes both generation and discrimination losses to leverage the strengths of each paradigm. Empirical results on FB15k-237-IMG and WN18-IMG show DiffusionCom achieving state-of-the-art performance, with notable gains in Hits@1 and robust ablations confirming the importance of MGAT, the denoiser design, and the dual training objective. This work highlights the practical potential of diffusion models for MKGC and emphasizes the value of structure-aware multimodal representations for complex reasoning tasks.

Abstract

Most current MKGC approaches are predominantly based on discriminative models that maximize conditional likelihood. These approaches struggle to efficiently capture the complex connections in real-world knowledge graphs, thereby limiting their overall performance. To address this issue, we propose a structure-aware multimodal Diffusion model for multimodal knowledge graph Completion (DiffusionCom). DiffusionCom innovatively approaches the problem from the perspective of generative models, modeling the association between the $(head, relation)$ pair and candidate tail entities as their joint probability distribution $p((head, relation), (tail))$, and framing the MKGC task as a process of gradually generating the joint probability distribution from noise. Furthermore, to fully leverage the structural information in MKGs, we propose Structure-MKGformer, an adaptive and structure-aware multimodal knowledge representation learning method, as the encoder for DiffusionCom. Structure-MKGformer captures rich structural information through a multimodal graph attention network (MGAT) and adaptively fuses it with entity representations, thereby enhancing the structural awareness of these representations. This design effectively addresses the limitations of existing MKGC methods, particularly those based on multimodal pre-trained models, in utilizing structural information. DiffusionCom is trained using both generative and discriminative losses for the generator, while the feature extractor is optimized exclusively with discriminative loss. This dual approach allows DiffusionCom to harness the strengths of both generative and discriminative models. Extensive experiments on the FB15k-237-IMG and WN18-IMG datasets demonstrate that DiffusionCom outperforms state-of-the-art models.

DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion

TL;DR

This paper tackles multimodal knowledge graph completion (MKGC) by reframing it as generative joint-distribution modeling. It introduces DiffusionCom, a diffusion-model-based framework that generates the joint distribution between and candidate tails, conditioned on a structure-aware multimodal encoder, Structure-MKGformer. The encoder fuses textual, visual, and graph-structural cues via MGAT and adaptive fusion, while a conditional denoiser guides the reverse diffusion; training optimizes both generation and discrimination losses to leverage the strengths of each paradigm. Empirical results on FB15k-237-IMG and WN18-IMG show DiffusionCom achieving state-of-the-art performance, with notable gains in Hits@1 and robust ablations confirming the importance of MGAT, the denoiser design, and the dual training objective. This work highlights the practical potential of diffusion models for MKGC and emphasizes the value of structure-aware multimodal representations for complex reasoning tasks.

Abstract

Most current MKGC approaches are predominantly based on discriminative models that maximize conditional likelihood. These approaches struggle to efficiently capture the complex connections in real-world knowledge graphs, thereby limiting their overall performance. To address this issue, we propose a structure-aware multimodal Diffusion model for multimodal knowledge graph Completion (DiffusionCom). DiffusionCom innovatively approaches the problem from the perspective of generative models, modeling the association between the pair and candidate tail entities as their joint probability distribution , and framing the MKGC task as a process of gradually generating the joint probability distribution from noise. Furthermore, to fully leverage the structural information in MKGs, we propose Structure-MKGformer, an adaptive and structure-aware multimodal knowledge representation learning method, as the encoder for DiffusionCom. Structure-MKGformer captures rich structural information through a multimodal graph attention network (MGAT) and adaptively fuses it with entity representations, thereby enhancing the structural awareness of these representations. This design effectively addresses the limitations of existing MKGC methods, particularly those based on multimodal pre-trained models, in utilizing structural information. DiffusionCom is trained using both generative and discriminative losses for the generator, while the feature extractor is optimized exclusively with discriminative loss. This dual approach allows DiffusionCom to harness the strengths of both generative and discriminative models. Extensive experiments on the FB15k-237-IMG and WN18-IMG datasets demonstrate that DiffusionCom outperforms state-of-the-art models.

Paper Structure

This paper contains 26 sections, 11 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: DiffusionCom for multimodal knowledge graphs completion. (a) We propose to model the correlation between the $(head, relation)$ and the candidate tail entities as their joint probability. (b) Diffusion models have demonstrated strong generative capabilities across various fields. Leveraging their coarse-to-fine generative characteristics, we employ diffusion models to generate joint probabilities.
  • Figure 2: Framework of the proposed DiffusionCom method.
  • Figure 3: An example in the knowledge graph.
  • Figure 4: Parameter sensitivity analysis on the FB15k-237-IMG dataset with respect to (a) the number of Multimodal Graph Attention Network layers and (b) the number of CDenoiser blocks in Conditional Denoising.
  • Figure 5: Parameter sensitivity analysis on the FB15k-237-IMG dataset with respect to (a) the hidden size of the MLP layer in the CDenoiser block, (b) the number of diffusion steps, and (c) the learning rate.
  • ...and 1 more figures