Table of Contents
Fetching ...

Multi-level Matching Network for Multimodal Entity Linking

Zhiwei Hu, Víctor Gutiérrez-Basulto, Ru Li, Jeff Z. Pan

TL;DR

This work introduces ${\rm M^3EL}$, a Multi-level Matching Network for Multimodal Entity Linking, to address two key gaps in MEL: intra-modal negative samples and bidirectional cross-modal interaction. It integrates three modules—Multimodal Feature Extraction (with intra-modal contrastive learning), Intra-modal Matching Network, and Cross-modal Matching Network—together with a joint training objective that unifies intra- and cross-modal signals. Empirical results on WikiMEL, RichpediaMEL, and WikiDiverse show that ${\rm M^3EL}$, including its ${\rm attr}$ and ${\rm desc}$ variants, significantly outperforms state-of-the-art baselines, particularly in high-precision matches ($\text{Hits@1}$) and in low-resource settings. The approach demonstrates that combining unimodal discriminative representations with bidirectional cross-modal interactions yields robust MEL performance and provides a strong blueprint for future multimodal knowledge-grounded linking systems.

Abstract

Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base. Most existing approaches to MEL are based on representation learning or vision-and-language pre-training mechanisms for exploring the complementary effect among multiple modalities. However, these methods suffer from two limitations. On the one hand, they overlook the possibility of considering negative samples from the same modality. On the other hand, they lack mechanisms to capture bidirectional cross-modal interaction. To address these issues, we propose a Multi-level Matching network for Multimodal Entity Linking (M3EL). Specifically, M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder and introduces an intra-modal contrastive learning sub-module to obtain better discriminative embeddings based on uni-modal differences; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity: Coarse-grained Global-to-Global and Fine-grained Global-to-Local, to achieve local and global level intra-modal interaction; (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching, to implement bidirectional cross-modal interaction. Extensive experiments conducted on WikiMEL, RichpediaMEL, and WikiDiverse datasets demonstrate the outstanding performance of M3EL when compared to the state-of-the-art baselines.

Multi-level Matching Network for Multimodal Entity Linking

TL;DR

This work introduces , a Multi-level Matching Network for Multimodal Entity Linking, to address two key gaps in MEL: intra-modal negative samples and bidirectional cross-modal interaction. It integrates three modules—Multimodal Feature Extraction (with intra-modal contrastive learning), Intra-modal Matching Network, and Cross-modal Matching Network—together with a joint training objective that unifies intra- and cross-modal signals. Empirical results on WikiMEL, RichpediaMEL, and WikiDiverse show that , including its and variants, significantly outperforms state-of-the-art baselines, particularly in high-precision matches () and in low-resource settings. The approach demonstrates that combining unimodal discriminative representations with bidirectional cross-modal interactions yields robust MEL performance and provides a strong blueprint for future multimodal knowledge-grounded linking systems.

Abstract

Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base. Most existing approaches to MEL are based on representation learning or vision-and-language pre-training mechanisms for exploring the complementary effect among multiple modalities. However, these methods suffer from two limitations. On the one hand, they overlook the possibility of considering negative samples from the same modality. On the other hand, they lack mechanisms to capture bidirectional cross-modal interaction. To address these issues, we propose a Multi-level Matching network for Multimodal Entity Linking (M3EL). Specifically, M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder and introduces an intra-modal contrastive learning sub-module to obtain better discriminative embeddings based on uni-modal differences; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity: Coarse-grained Global-to-Global and Fine-grained Global-to-Local, to achieve local and global level intra-modal interaction; (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching, to implement bidirectional cross-modal interaction. Extensive experiments conducted on WikiMEL, RichpediaMEL, and WikiDiverse datasets demonstrate the outstanding performance of M3EL when compared to the state-of-the-art baselines.

Paper Structure

This paper contains 15 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: An example of MEL. Dotted boxes of different colors represent different features: color purple for mention textual description (mention text), color orange for mention visual context (mention image), color green for entity textual description (entity text), color blue for entity visual context (entity image).
  • Figure 2: The structure of the ${\rm M^3EL}$ model, containing three modules: Multimodal Feature Extraction (MFE) with Intra-modal Contrastive Learning (ICL), Intra-modal Matching Network (IMN) and Cross-modal Matching Network (CMN). Att and M-Att denote the attention and multi-heads attention mechanisms, respectively.
  • Figure 3: Illustrative comparison of intra-modal contrastive learning with CLIP, where the red dashed lines represent the positive samples, yellow lines denote the negative samples in CLIP, purple and blue lines represent the inner-source and intra-source negative samples. Circles and squares represent textual and visual features, $e$ and $m$ represent entity and mention, respectively.
  • Figure 4: Parameter sensitivity experiments under different conditions on WikiMEL, RichpediaMEL and WikiDiverse datasets.