Table of Contents
Fetching ...

DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking

Shezheng Song, Shasha Li, Shan Zhao, Xiaopeng Li, Chengyu Wang, Jie Yu, Jun Ma, Tianwei Yan, Bin Ji, Xiaoguang Mao

TL;DR

DWE+ tackles multimodal entity linking by addressing image noise and semantic misalignment between knowledge-base entities and their representations. It introduces object-oriented, fine-grained visual features, explicit visual attributes, and static/dynamic entity representations enriched via Wikipedia descriptions and large-language models. The framework employs gated feature fusion and hierarchical contrastive training to align coarse (text/image) and fine (mention/visual objects) semantics, achieving state-of-the-art results on enhanced datasets Rich-S/Wiki-S/Diverse-S and Rich-D/Wiki-D/Diverse-D. The work offers publicly released enhanced datasets and demonstrates significant MEL improvements with practical implications for robust multimodal knowledge retrieval and linking tasks.

Abstract

Multimodal entity linking (MEL) aims to utilize multimodal information (usually textual and visual information) to link ambiguous mentions to unambiguous entities in knowledge base. Current methods facing main issues: (1)treating the entire image as input may contain redundant information. (2)the insufficient utilization of entity-related information, such as attributes in images. (3)semantic inconsistency between the entity in knowledge base and its representation. To this end, we propose DWE+ for multimodal entity linking. DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities. This is achieved by three aspects: (a)we introduce a method for extracting fine-grained image features by partitioning the image into multiple local objects. Then, hierarchical contrastive learning is used to further align semantics between coarse-grained information(text and image) and fine-grained (mention and visual objects). (b)we explore ways to extract visual attributes from images to enhance fusion feature such as facial features and identity. (c)we leverage Wikipedia and ChatGPT to capture the entity representation, achieving semantic enrichment from both static and dynamic perspectives, which better reflects the real-world entity semantics. Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance. Specifically, we optimize these datasets and achieve state-of-the-art performance on the enhanced datasets. The code and enhanced datasets are released on https://github.com/season1blue/DWET

DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking

TL;DR

DWE+ tackles multimodal entity linking by addressing image noise and semantic misalignment between knowledge-base entities and their representations. It introduces object-oriented, fine-grained visual features, explicit visual attributes, and static/dynamic entity representations enriched via Wikipedia descriptions and large-language models. The framework employs gated feature fusion and hierarchical contrastive training to align coarse (text/image) and fine (mention/visual objects) semantics, achieving state-of-the-art results on enhanced datasets Rich-S/Wiki-S/Diverse-S and Rich-D/Wiki-D/Diverse-D. The work offers publicly released enhanced datasets and demonstrates significant MEL improvements with practical implications for robust multimodal knowledge retrieval and linking tasks.

Abstract

Multimodal entity linking (MEL) aims to utilize multimodal information (usually textual and visual information) to link ambiguous mentions to unambiguous entities in knowledge base. Current methods facing main issues: (1)treating the entire image as input may contain redundant information. (2)the insufficient utilization of entity-related information, such as attributes in images. (3)semantic inconsistency between the entity in knowledge base and its representation. To this end, we propose DWE+ for multimodal entity linking. DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities. This is achieved by three aspects: (a)we introduce a method for extracting fine-grained image features by partitioning the image into multiple local objects. Then, hierarchical contrastive learning is used to further align semantics between coarse-grained information(text and image) and fine-grained (mention and visual objects). (b)we explore ways to extract visual attributes from images to enhance fusion feature such as facial features and identity. (c)we leverage Wikipedia and ChatGPT to capture the entity representation, achieving semantic enrichment from both static and dynamic perspectives, which better reflects the real-world entity semantics. Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance. Specifically, we optimize these datasets and achieve state-of-the-art performance on the enhanced datasets. The code and enhanced datasets are released on https://github.com/season1blue/DWET
Paper Structure (37 sections, 12 equations, 6 figures, 12 tables)

This paper contains 37 sections, 12 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Example of entity linking for the mention Trump.
  • Figure 2: Dual-way Matching between mention and entity.
  • Figure 3: Textual Description of Donald Trump from Wikipedia.
  • Figure 4: Overview of our method. Input consists of image $I$, text $t$, and mention $m$. Object detection is applied to extract object feature $d_i$ from image. Facial feature $f$ and identity feature $s$ are retrieved from image.
  • Figure 5: The preprocessing results on ViT, analyzing the similarity with facial images in the library. ViT is pretrained on MS-Celeb1M.
  • ...and 1 more figures