Table of Contents
Fetching ...

DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model

Shezheng Song, Shasha Li, Jie Yu, Shan Zhao, Xiaopeng Li, Jun Ma, Xiaodong Liu, Zhuo Li, Xiaoguang Mao

TL;DR

This work tackles multimodal entity linking by addressing two core issues: ambiguous, static entity representations and limited exploitation of image information. It introduces ChatGPT-driven dynamic entity representations to refresh knowledge-base entities and presents the Dynamic Integration of Multimodal information (DIM), a framework that combines CLIP-based textual/visual features with BLIP-2–derived expert cues, fused through multi-head attention and optimized with N-pair loss. The authors validate DIM on both original and ChatGPT-augmented datasets (Wiki+, Rich+, Diverse+), achieving strong gains and state-of-the-art performance on the enhanced sets, demonstrating improved alignment between visual cues and evolving entity semantics. The work also provides datasets and code to support reproducibility and further research, while acknowledging potential biases inherent in large-language-model–driven data augmentation and suggesting avenues to mitigate them.

Abstract

Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original datasets, and achieves state-of-the-art (SOTA) on the dynamically enhanced datasets (Wiki+, Rich+, Diverse+). For reproducibility, our code and collected datasets are released on \url{https://github.com/season1blue/DIM}.

DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model

TL;DR

This work tackles multimodal entity linking by addressing two core issues: ambiguous, static entity representations and limited exploitation of image information. It introduces ChatGPT-driven dynamic entity representations to refresh knowledge-base entities and presents the Dynamic Integration of Multimodal information (DIM), a framework that combines CLIP-based textual/visual features with BLIP-2–derived expert cues, fused through multi-head attention and optimized with N-pair loss. The authors validate DIM on both original and ChatGPT-augmented datasets (Wiki+, Rich+, Diverse+), achieving strong gains and state-of-the-art performance on the enhanced sets, demonstrating improved alignment between visual cues and evolving entity semantics. The work also provides datasets and code to support reproducibility and further research, while acknowledging potential biases inherent in large-language-model–driven data augmentation and suggesting avenues to mitigate them.

Abstract

Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original datasets, and achieves state-of-the-art (SOTA) on the dynamically enhanced datasets (Wiki+, Rich+, Diverse+). For reproducibility, our code and collected datasets are released on \url{https://github.com/season1blue/DIM}.
Paper Structure (17 sections, 4 equations, 3 figures, 4 tables)

This paper contains 17 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The process of integrating human cognition with information in the knowledge base. When someone says "This is why Taylor ....", MEL aims to link the mentioned 'Taylor' to 'Taylor Alison Swift' in knowledge base to facilitate further understanding of user semantics.
  • Figure 2: Statistics of enhanced datasets including Richpedia, Wikimel, and Wikidiverse.
  • Figure 3: Model overview. Example is an image with mention $m$Trump, text $t$Trump and his wife Melania at Wedding. $c$ is the result of the expert model. Npairloss in contrastive learning is to ensure close distances for same-category samples and distinct distances for different-category samples.