Table of Contents
Fetching ...

Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia

Tomás Feith, Akhil Arora, Martin Gerlach, Debjit Paul, Robert West

TL;DR

A framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI are developed and it is shown that XLocEI outperforms all baseline models and can be applied in a zero-shot manner on languages not seen during training with minimal performance drop.

Abstract

Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.

Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia

TL;DR

A framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI are developed and it is shown that XLocEI outperforms all baseline models and can be applied in a zero-shot manner on languages not seen during training with minimal performance drop.

Abstract

Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
Paper Structure (40 sections, 6 equations, 11 figures, 16 tables)

This paper contains 40 sections, 6 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Entity linking: insert a link to the entity Margaret "Peggy" Woolley by identifying a suitable mention from the existent text in the version before insertion, vs. Entity insertion: no mention existent yet, identify the most suitable span ab in the version before to insert the entity Private school.
  • Figure 2: Challenges of entity insertion. (Left) Micro (weighted by the number of data points in a language) and macro (equal weight to each language) aggregates of insertion types over the 105 languages considered in this study. (Right) Complementary cumulative distribution function (CCDF) of the number of candidate sentences ($N$) in a Wikipedia article (log x-axis).
  • Figure 3: Data processing pipeline. Obtain added links $L$ by taking a set difference of the links existent in consecutive months. For each added link $L_i$, scan all $M$ versions in the full revision history $v_0^i$ to $v_M^i$ to identify the article version in which the link was added and compute the difference between the before and after versions to extract the exact entity insertion scenario.
  • Figure 4: Architectural overview of LocEI. The target entity $E_{tgt}$ and each candidate text span $x \in \mathcal{X}_{src}$ of the source entity $E_{src}$ are concatenated together and encoded jointly using a transformer encoder. The relevance scores of candidate text spans are computed using an MLP trained via a list-wise ranking objective.
  • Figure 5: The distribution of entity insertion categories across the 20 considered Wikipedia language versions from October to November 2023. The x-axis shows the language code and the number of links added in each language.
  • ...and 6 more figures