Table of Contents
Fetching ...

Two Approaches to Diachronic Normalization of Polish Texts

Kacper Dudzic, Filip Graliński, Krzysztof Jassem, Marek Kubis, Piotr Wierzchoń

TL;DR

This work tackles diachronic normalization of Polish texts by comparing a handcrafted rule-based Transducers system with a plT5-based neural normalization model on a parallel corpus derived from Wikisource and Wolne Lektury. Four dataset variants are created to study the impact of pruning identical paragraphs and separating train/test by novel, enabling a robust evaluation of both approaches. Results show the rule-based method generally achieves lower character and word error rates, while the neural models excel on a separated and pruned dataset, indicating complementary strengths depending on data configuration. The study highlights the potential for hybrid approaches and emphasizes data quality and language-specific considerations as crucial factors for improving historical text normalization.

Abstract

This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages.

Two Approaches to Diachronic Normalization of Polish Texts

TL;DR

This work tackles diachronic normalization of Polish texts by comparing a handcrafted rule-based Transducers system with a plT5-based neural normalization model on a parallel corpus derived from Wikisource and Wolne Lektury. Four dataset variants are created to study the impact of pruning identical paragraphs and separating train/test by novel, enabling a robust evaluation of both approaches. Results show the rule-based method generally achieves lower character and word error rates, while the neural models excel on a separated and pruned dataset, indicating complementary strengths depending on data configuration. The study highlights the potential for hybrid approaches and emphasizes data quality and language-specific considerations as crucial factors for improving historical text normalization.

Abstract

This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages.
Paper Structure (12 sections, 2 tables)