Table of Contents
Fetching ...

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

Liu Qi, He Yongyi, Lian Defu, Zheng Zhi, Xu Tong, Liu Che, Chen Enhong

TL;DR

UniMEL tackles Multimodal Entity Linking by unifying Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to augment mentions and entities with multimodal context. The framework uses LLMs to generate concise entity descriptions and MLLMs to produce rich mention descriptions, followed by embedding-based retrieval to prune candidates and an LLM-based selector to pick the final entity from a small set. Key contributions include a four-module architecture, a universal prompt set, and a lightweight fine-tuning strategy that achieves state-of-the-art results on three public MEL datasets, with notable gains and demonstrated generality across several LLM backbones. The work highlights the practical impact of tightly integrating multimodal representations and reasoning to improve disambiguation in noisy, real-world multimodal knowledge bases.

Abstract

Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of Large Language Models (LLMs) with robust capabilities in text understanding and reasoning, particularly Multimodal Large Language Models (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge. However, how to design a universally applicable LLMs-based MEL approach remains a pressing challenge. To this end, we propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using LLMs. In this framework, we employ LLMs to augment the representation of mentions and entities individually by integrating textual and visual information and refining textual information. Subsequently, we employ the embedding-based method for retrieving and re-ranking candidate entities. Then, with only ~0.26% of the model parameters fine-tuned, LLMs can make the final selection from the candidate entities. Extensive experiments on three public benchmark datasets demonstrate that our solution achieves state-of-the-art performance, and ablation studies verify the effectiveness of all modules. Our code is available at https://github.com/Javkonline/UniMEL.

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

TL;DR

UniMEL tackles Multimodal Entity Linking by unifying Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to augment mentions and entities with multimodal context. The framework uses LLMs to generate concise entity descriptions and MLLMs to produce rich mention descriptions, followed by embedding-based retrieval to prune candidates and an LLM-based selector to pick the final entity from a small set. Key contributions include a four-module architecture, a universal prompt set, and a lightweight fine-tuning strategy that achieves state-of-the-art results on three public MEL datasets, with notable gains and demonstrated generality across several LLM backbones. The work highlights the practical impact of tightly integrating multimodal representations and reasoning to improve disambiguation in noisy, real-world multimodal knowledge bases.

Abstract

Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of Large Language Models (LLMs) with robust capabilities in text understanding and reasoning, particularly Multimodal Large Language Models (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge. However, how to design a universally applicable LLMs-based MEL approach remains a pressing challenge. To this end, we propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using LLMs. In this framework, we employ LLMs to augment the representation of mentions and entities individually by integrating textual and visual information and refining textual information. Subsequently, we employ the embedding-based method for retrieving and re-ranking candidate entities. Then, with only ~0.26% of the model parameters fine-tuned, LLMs can make the final selection from the candidate entities. Extensive experiments on three public benchmark datasets demonstrate that our solution achieves state-of-the-art performance, and ablation studies verify the effectiveness of all modules. Our code is available at https://github.com/Javkonline/UniMEL.
Paper Structure (23 sections, 11 equations, 2 figures, 9 tables)

This paper contains 23 sections, 11 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: An example of Multimodal Entity Linking. Left: mention context, including mention image and mention text (highlighted in red). Right: the similar candidate entities in Multimodal Knowledge Base, each with its entity image and entity text (highlighted in blue).
  • Figure 2: An overview of the UniMEL framework, which consists of four modules: (a) MLLMs-based Mention Augmentation, (b) LLMs-based Entity Augmentation, (c) Retrieval Augmentation and (d) Multi-choice Selection. Input consists of mention and entities, the frozen MLLM is applied to generate the mention description and the frozen LLM is applied to summary the entities description. And the tuned LLM is applied to select the referent entity for the mention.