Table of Contents
Fetching ...

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

TL;DR

Experiments on WN9-IMG and two novel fine art MKGs demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks, highlighting the value of VLMs for multimodal KGE.

Abstract

Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

TL;DR

Experiments on WN9-IMG and two novel fine art MKGs demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks, highlighting the value of VLMs for multimodal KGE.

Abstract

Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.
Paper Structure (29 sections, 8 equations, 4 figures, 9 tables)

This paper contains 29 sections, 8 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Example triples from the WN9-IMG dataset wn9_img_dataset. Entities correspond to ImageNet imagenet synsets, represented by sets of images (shown in red) and their WordNet wordnet textual definitions (shown in cyan), connected by semantic relations.
  • Figure 2: Example subgraphs from the WikiArt-MKGs. Artworks are represented visually, while associated entities (e.g., artists, styles, genres, locations) are represented textually. WikiArt-MKG-v1 (inner dashed box) captures core artwork-level relations, whereas WikiArt-MKG-v2 (outer dashed box) extends the graph with additional entity types and richer semantic links.
  • Figure 3: Qualitative comparison of zero-shot CLIP and VL-ComplEx (base: CLIP) on WikiArt-MKG-v2. Given an artwork (top rows) or an artist (bottom rows) as a query, we show the top-5 predicted entities for selected relations. For artist queries, we use only textual input representations. Correctly retrieved entities are shown in bold.
  • Figure 4: Per-relation mean reciprocal rank (MRR) on the WikiArt-MKG-v2 validation set for zero-shot CLIP and VL-KGEs.