Multi-modal Knowledge Graph Generation with Semantics-enriched Prompts
Yajing Xu, Zhiqiang Liu, Jiaoyan Chen, Mingchen Tu, Zhuo Chen, Jeff Z. Pan, Yichi Zhang, Yushan Zhu, Wen Zhang, Huajun Chen
TL;DR
The paper tackles the challenge of enriching conventional knowledge graphs with high-quality, contextually relevant images by proposing a VSNS-driven pipeline that jointly selects visualizable and structurally informative neighbors, generates semantics-enriched prompts via an LLM, and synthesizes images with a diffusion model. The VISUALIZABLE AND STRUCTURAL NEIGHBOR SELECTION (VSNS) framework, comprising Visualizable Neighbor Selection (VNS) and Structural Neighbor Selection (SNS), is combined with prompted image generation and diffusion-based synthesis to create MMKGs-A from KGs. Thorough evaluations on MKG-Y and DB15K demonstrate improvements in image quality (lower FID, higher CLIPscore) and stronger alignment with KG content, as well as positive downstream effects on multimodal knowledge graph completion (MMKGC). The results support the viability of automated, neighbor-informed prompt generation and diffusion-based image synthesis for scalable MMKG construction, with future work addressing abstract entities and broader downstream tasks.
Abstract
Multi-modal Knowledge Graphs (MMKGs) have been widely applied across various domains for knowledge representation. However, the existing MMKGs are significantly fewer than required, and their construction faces numerous challenges, particularly in ensuring the selection of high-quality, contextually relevant images for knowledge graph enrichment. To address these challenges, we present a framework for constructing MMKGs from conventional KGs. Furthermore, to generate higher-quality images that are more relevant to the context in the given knowledge graph, we designed a neighbor selection method called Visualizable Structural Neighbor Selection (VSNS). This method consists of two modules: Visualizable Neighbor Selection (VNS) and Structural Neighbor Selection (SNS). The VNS module filters relations that are difficult to visualize, while the SNS module selects neighbors that most effectively capture the structural characteristics of the entity. To evaluate the quality of the generated images, we performed qualitative and quantitative evaluations on two datasets, MKG-Y and DB15K. The experimental results indicate that using the VSNS method to select neighbors results in higher-quality images that are more relevant to the knowledge graph.
