Table of Contents
Fetching ...

Historical German Text Normalization Using Type- and Token-Based Language Modeling

Anton Ehrmanntraut

TL;DR

This work tackles historical German text normalization by combining a type-based encoder–decoder Transformer with a sentence-level language-model re-ranking to produce contemporary orthography from texts circa 1700–1900. The two-stage hybrid approach leverages a substitution lexicon and an OOV-capable Type Transformer, then uses a German GPT-2 LM to select the most plausible sentence-level normalization, addressing context-sensitive ambiguities. Evaluated on the DTA EvalCorpus-derived DTAEC dataset, the hybrid model achieves state-of-the-art performance and substantially outperforms a production CAB-based baseline, though it does not clearly exceed sentence-based Transnormer results and reveals persistent gaps for unseen tokens. The study underscores the remaining data scarcity and editorial inconsistency in historical corpora, and points to open questions about generalization, cross-domain transfer, and downstream NLP improvements from orthographic normalization.

Abstract

Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.

Historical German Text Normalization Using Type- and Token-Based Language Modeling

TL;DR

This work tackles historical German text normalization by combining a type-based encoder–decoder Transformer with a sentence-level language-model re-ranking to produce contemporary orthography from texts circa 1700–1900. The two-stage hybrid approach leverages a substitution lexicon and an OOV-capable Type Transformer, then uses a German GPT-2 LM to select the most plausible sentence-level normalization, addressing context-sensitive ambiguities. Evaluated on the DTA EvalCorpus-derived DTAEC dataset, the hybrid model achieves state-of-the-art performance and substantially outperforms a production CAB-based baseline, though it does not clearly exceed sentence-based Transnormer results and reveals persistent gaps for unseen tokens. The study underscores the remaining data scarcity and editorial inconsistency in historical corpora, and points to open questions about generalization, cross-domain transfer, and downstream NLP improvements from orthographic normalization.

Abstract

Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.
Paper Structure (27 sections, 5 equations, 4 figures, 6 tables)

This paper contains 27 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison between a OCR'd historical print and a potential example of its normalized equivalent. Underlined spans correspond to changes (trivial transliterations of long-s and superscript-e are ignored).
  • Figure 2: Example usage of the spacing encoding scheme using the pseudo-characters and . Note that both the source and target sequence contain the same number of tokens.
  • Figure 3: Excerpt from the DTA EvalCorpus and the corresponding section in the transformed DTAEC dataset. Observe how the second pre-processing step already transliterated the $\langle$TS1 s$\rangle$ in the “orig” column.
  • Figure 4: Selection of character errors made by the hybrid system on the DTAEC test split.