Table of Contents
Fetching ...

Named Entity Recognition and Classification on Historical Documents: A Survey

Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, Antoine Doucet

TL;DR

The survey analyzes named entity recognition on historical documents, addressing a challenging combination of domain variety, OCR/noise, diachronic language change, and resource scarcity. It documents resources (typologies, corpora, embeddings) and surveys past and current approaches, highlighting a shift from rule-based and traditional ML to deep learning with transfer learning. Key findings show that BiLSTM-CRF and transformer-based models, especially when combined with in-domain and contextual embeddings, deliver state-of-the-art results despite data constraints, yet comparability remains difficult due to diverse evaluation setups. The work emphasizes transferability, robustness to noise and language change, and resource sharing as crucial directions to enable scalable, accurate historical NER across languages and periods, with practical impact on semantic indexing and humanities research.

Abstract

After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.

Named Entity Recognition and Classification on Historical Documents: A Survey

TL;DR

The survey analyzes named entity recognition on historical documents, addressing a challenging combination of domain variety, OCR/noise, diachronic language change, and resource scarcity. It documents resources (typologies, corpora, embeddings) and surveys past and current approaches, highlighting a shift from rule-based and traditional ML to deep learning with transfer learning. Key findings show that BiLSTM-CRF and transformer-based models, especially when combined with in-domain and contextual embeddings, deliver state-of-the-art results despite data constraints, yet comparability remains difficult due to diverse evaluation setups. The work emphasizes transferability, robustness to noise and language change, and resource sharing as crucial directions to enable scalable, accurate historical NER across languages and periods, with practical impact on semantic indexing and humanities research.

Abstract

After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.

Paper Structure

This paper contains 56 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Swiss journal L'Impartial, issue of 31 Dec 1918. Facsimile of the first page (left), zoom on an article (middle), and OCR of this article as provided by the Swiss National Library (completed in the 2010s) (right).
  • Figure 2: Results of a CRF, BiLSTM-CRF and BERT-based NER systems on excerpts from the French Swiss Gazette de Lausanne of August 4 1818, p.4 (1), and from the German Luxembourgian Luxemburger Wort of July 21 1868, p.2 (2) (HIPE data), compared to the ground truth (NE GT).