Table of Contents
Fetching ...

Multilingual Entity Linking Using Dense Retrieval

Dominik Farhan

TL;DR

This work tackles multilingual entity linking under resource constraints, demonstrating that competitive EL can be achieved with fast training on public data. It analyzes baselines based on alias tables and embeddings, then advances to bi-encoder fine-tuning using the LEALLA family on the DaMuEL dataset, achieving strong Spanish results and competitive cross-lingual performance. The study provides a thorough hyperparameter analysis (index rebuilding, batch size, hard negatives, logit scaling) and investigates cross-lingual transfer, including the potential of using training contexts as a knowledge base. Overall, the results show that high-quality, multilingual EL is attainable with modest infrastructure, paving the way for broader accessibility and reproducibility in EL research.

Abstract

Entity linking (EL) is the computational process of connecting textual mentions to corresponding entities. Like many areas of natural language processing, the EL field has greatly benefited from deep learning, leading to significant performance improvements. However, present-day approaches are expensive to train and rely on diverse data sources, complicating their reproducibility. In this thesis, we develop multiple systems that are fast to train, demonstrating that competitive entity linking can be achieved without a large GPU cluster. Moreover, we train on a publicly available dataset, ensuring reproducibility and accessibility. Our models are evaluated for 9 languages giving an accurate overview of their strengths. Furthermore, we offer a~detailed analysis of bi-encoder training hyperparameters, a popular approach in EL, to guide their informed selection. Overall, our work shows that building competitive neural network based EL systems that operate in multiple languages is possible even with limited resources, thus making EL more approachable.

Multilingual Entity Linking Using Dense Retrieval

TL;DR

This work tackles multilingual entity linking under resource constraints, demonstrating that competitive EL can be achieved with fast training on public data. It analyzes baselines based on alias tables and embeddings, then advances to bi-encoder fine-tuning using the LEALLA family on the DaMuEL dataset, achieving strong Spanish results and competitive cross-lingual performance. The study provides a thorough hyperparameter analysis (index rebuilding, batch size, hard negatives, logit scaling) and investigates cross-lingual transfer, including the potential of using training contexts as a knowledge base. Overall, the results show that high-quality, multilingual EL is attainable with modest infrastructure, paving the way for broader accessibility and reproducibility in EL research.

Abstract

Entity linking (EL) is the computational process of connecting textual mentions to corresponding entities. Like many areas of natural language processing, the EL field has greatly benefited from deep learning, leading to significant performance improvements. However, present-day approaches are expensive to train and rely on diverse data sources, complicating their reproducibility. In this thesis, we develop multiple systems that are fast to train, demonstrating that competitive entity linking can be achieved without a large GPU cluster. Moreover, we train on a publicly available dataset, ensuring reproducibility and accessibility. Our models are evaluated for 9 languages giving an accurate overview of their strengths. Furthermore, we offer a~detailed analysis of bi-encoder training hyperparameters, a popular approach in EL, to guide their informed selection. Overall, our work shows that building competitive neural network based EL systems that operate in multiple languages is possible even with limited resources, thus making EL more approachable.
Paper Structure (87 sections, 11 equations, 15 figures, 21 tables, 3 algorithms)

This paper contains 87 sections, 11 equations, 15 figures, 21 tables, 3 algorithms.

Figures (15)

  • Figure 1: An example of Wikidata statements for San Francisco. By Jeblad, CC BY-SA 3.0, via Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Linked_Data_-_San_Francisco.svg.
  • Figure 2: A general layout of bi-encoder models used in EL.
  • Figure 3: An illustration of EL system based on aliases and an embedding model.
  • Figure 4: Overview of the proposed bi-encoder. Observe that just one model is used.
  • Figure 5: A diagram showing how to use bi-encoder to produce logits for softmax.
  • ...and 10 more figures