Table of Contents
Fetching ...

Musical Heritage Historical Entity Linking

Arianna Graciotti, Nicolas Lazzari, Valentina Presutti, Rocco Tripodi

TL;DR

The paper addresses the challenge of Historical Entity Linking by introducing MHERCL, a gold-standard benchmark derived from historical music periodicals that emphasizes long-tail and NIL entities. It proposes two methods—an unsupervised Entity Linking Dynamics (ELD) model and a Constrained-BLINK (C-BLINK) that uses Wikidata-based time and type constraints—to mitigate OCR noise and temporal/domain shifts. Empirical results show that standard SotA linkers struggle on HEL, while incorporating plausibility constraints and NIL handling yields substantial gains, with large-language models offering strong complementary performance. The work demonstrates that NIL-aware, knowledge-graph-constrained retrieval methods plus unsupervised game-theoretic disambiguation can robustly link historical mentions to KB entities, providing valuable tools for historical knowledge extraction and long-tail EL research.

Abstract

Linking named entities occurring in text to their corresponding entity in a Knowledge Base (KB) is challenging, especially when dealing with historical texts. In this work, we introduce Musical Heritage named Entities Recognition, Classification and Linking (MHERCL), a novel benchmark consisting of manually annotated sentences extrapolated from historical periodicals of the music domain. MHERCL contains named entities under-represented or absent in the most famous KBs. We experiment with several State-of-the-Art models on the Entity Linking (EL) task and show that MHERCL is a challenging dataset for all of them. We propose a novel unsupervised EL model and a method to extend supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main difficulties posed by historical documents. Our experiments reveal that relying on unsupervised techniques and improving models with logical constraints based on KGs and heuristics to predict NIL entities (entities not represented in the KB of reference) results in better EL performance on historical documents.

Musical Heritage Historical Entity Linking

TL;DR

The paper addresses the challenge of Historical Entity Linking by introducing MHERCL, a gold-standard benchmark derived from historical music periodicals that emphasizes long-tail and NIL entities. It proposes two methods—an unsupervised Entity Linking Dynamics (ELD) model and a Constrained-BLINK (C-BLINK) that uses Wikidata-based time and type constraints—to mitigate OCR noise and temporal/domain shifts. Empirical results show that standard SotA linkers struggle on HEL, while incorporating plausibility constraints and NIL handling yields substantial gains, with large-language models offering strong complementary performance. The work demonstrates that NIL-aware, knowledge-graph-constrained retrieval methods plus unsupervised game-theoretic disambiguation can robustly link historical mentions to KB entities, providing valuable tools for historical knowledge extraction and long-tail EL research.

Abstract

Linking named entities occurring in text to their corresponding entity in a Knowledge Base (KB) is challenging, especially when dealing with historical texts. In this work, we introduce Musical Heritage named Entities Recognition, Classification and Linking (MHERCL), a novel benchmark consisting of manually annotated sentences extrapolated from historical periodicals of the music domain. MHERCL contains named entities under-represented or absent in the most famous KBs. We experiment with several State-of-the-Art models on the Entity Linking (EL) task and show that MHERCL is a challenging dataset for all of them. We propose a novel unsupervised EL model and a method to extend supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main difficulties posed by historical documents. Our experiments reveal that relying on unsupervised techniques and improving models with logical constraints based on KGs and heuristics to predict NIL entities (entities not represented in the KB of reference) results in better EL performance on historical documents.

Paper Structure

This paper contains 32 sections, 2 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Example sentence from an 1824 document in the Polifonia Corpus, with Naumann as the entity to be linked. The figure includes a selection of results from the Wikipedia disambiguation page (https://en.wikipedia.org/wiki/Naumann). Using sentence context and metadata from both the entity and the document allows for identifying the plausible entity, which in this case is unique.
  • Figure 2: Figures showing the distribution of named entity popularity in MHERCL (\ref{['fig:mhercl_popdist']}), HIPE-2020 (\ref{['fig:clef_popdist']}), AIDA-COnLL-YAGO (\ref{['fig:aida_popdist']}) benchmarks and BLINK's training dataset (\ref{['fig:blinkts_popdist']}). Each figure contains a histogram showing the density of entities' popularity levels: the higher the density, the more common the entities with the corresponding popularity value. Popularity is computed as the frequency of occurrence of each named entity's QID as an internal link in Wikipedia. NIL are excluded in MHERCL and HIPE-2020.
  • Figure 3: Kernel Density Estimation (KDE) plot comparing the smoothed density functions of named entities popularity across the MHERCL, HIPE-2020, AIDA CoNLL-YAGO and BLINK's Training Set (TS) datasets. For MHERCL and HIPE-2020, only entities with a valid QID are considered, NIL entities being excluded from the analysis.

Theorems & Definitions (2)

  • Example 1
  • Example 2