Table of Contents
Fetching ...

KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

Juyeon Kim, Geon Lee, Taeuk Kim, Kijung Shin

TL;DR

KGMEL tackles multimodal entity linking by leveraging knowledge-graph triples to reduce ambiguity beyond textual and visual signals. It introduces a generate-retrieve-rerank pipeline: (1) generate KG triples for mentions using vision-language models, (2) learn joint mention-entity embeddings from text, images, and triples to retrieve candidate entities, and (3) rerank candidates by filtering triples and invoking an LLM to select the best match. The approach yields state-of-the-art results across three MEL benchmarks, with gains up to $19.13\%$ in HITS@1, and ablations confirm the crucial roles of visual input, triples, and gated fusion. This work demonstrates that KG structure can dramatically enhance MEL, offering practical benefits for semantic search, QA, and related tasks; code and datasets are publicly available for reproducibility.

Abstract

Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.

KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

TL;DR

KGMEL tackles multimodal entity linking by leveraging knowledge-graph triples to reduce ambiguity beyond textual and visual signals. It introduces a generate-retrieve-rerank pipeline: (1) generate KG triples for mentions using vision-language models, (2) learn joint mention-entity embeddings from text, images, and triples to retrieve candidate entities, and (3) rerank candidates by filtering triples and invoking an LLM to select the best match. The approach yields state-of-the-art results across three MEL benchmarks, with gains up to in HITS@1, and ablations confirm the crucial roles of visual input, triples, and gated fusion. This work demonstrates that KG structure can dramatically enhance MEL, offering practical benefits for semantic search, QA, and related tasks; code and datasets are publicly available for reproducibility.

Abstract

Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.

Paper Structure

This paper contains 21 sections, 14 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example of multimodal entity linking (MEL) using KGMEL. KGMEL generates triples for the mention to be matched with knowledge graph (KG) triples in the knowledge base (KB). In the figure, blue and yellow arrows point to triples derived from visual and textual context, respectively.
  • Figure 2: (Left) Comparison of the average number and word length of descriptions and triples per entity across WikiDiverse, RichpediaMEL, and WikiMEL datasets. (Right) t-SNE visualization illustrating the contextual similarity between mention sentences, entity descriptions, and entity triples.
  • Figure 3: Overview of KGMEL. Our framework consists of three stages: (1) Generation: We generate triples for mentions using VLMs. (2) Retrieval: We obtain joint embeddings by integrating textual, visual, and triple-based embeddings, and using them, we retrieve $K$ candidates. (3) Reranking: After filtering out irrelevant KG triples and retaining only those relevant to the mention, for each candidate, we determine the best-matching entity using LLMs.