Table of Contents
Fetching ...

Neural Text Normalization for Luxembourgish using Real-Life Variation Data

Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank

TL;DR

This work tackles the challenge of Luxembourgish orthographic variation by introducing the first generative text normalization models trained on real-life variation data. It constructs a large parallel training set from Spellchecker.lu variants and Chamber of Deputies transcripts, then fine-tunes ByT5 and mT5 while benchmarking against GPT-4o, Llama, and Spellux. The evaluation combines quantitative metrics (ERR, CER, accuracy, recall, precision, F1) with a linguistically motivated qualitative test suite covering 21 rules. Findings show ByT5 base as the strongest among the tested sequence models, with GPT-4o performing strongly in some settings but facing reproducibility concerns, and Spellux providing a competitive baseline. The study demonstrates that real-life variation data enables effective Luxembourgish normalization and highlights strengths and trade-offs across byte-based, word-based, and pipeline approaches, with potential applicability to linguistic analysis and standardization efforts.

Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

Neural Text Normalization for Luxembourgish using Real-Life Variation Data

TL;DR

This work tackles the challenge of Luxembourgish orthographic variation by introducing the first generative text normalization models trained on real-life variation data. It constructs a large parallel training set from Spellchecker.lu variants and Chamber of Deputies transcripts, then fine-tunes ByT5 and mT5 while benchmarking against GPT-4o, Llama, and Spellux. The evaluation combines quantitative metrics (ERR, CER, accuracy, recall, precision, F1) with a linguistically motivated qualitative test suite covering 21 rules. Findings show ByT5 base as the strongest among the tested sequence models, with GPT-4o performing strongly in some settings but facing reproducibility concerns, and Spellux providing a competitive baseline. The study demonstrates that real-life variation data enables effective Luxembourgish normalization and highlights strengths and trade-offs across byte-based, word-based, and pipeline approaches, with potential applicability to linguistic analysis and standardization efforts.

Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

Paper Structure

This paper contains 14 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Illustration of the creation of training data with the Luxembourgish Online Dictionary (LOD) sentence 'Drink milk with honey, then your throat will no longer hurt.' and the variational statistical data for 'milk'. The algorithm processes every word sequentially, this illustrates only the replacement process for the word 'milk'.