Table of Contents
Fetching ...

Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Wan-Hua Her, Udo Kruschwitz

TL;DR

This study addresses the challenge of neural machine translation for low-resource languages by examining bidirectional German–Bavarian translation with a Transformer baseline, Back-translation, and Transfer Learning. It leverages a diverse data pipeline (parallel and monolingual corpora) and a 5-fold cross-validation framework, evaluating with BLEU, chrF, and TER, and applying a Bonferroni-corrected significance threshold of $0.017$. The findings show unexpectedly high baselines due to language similarity, with Back-translation delivering statistically significant improvements and Transfer Learning offering substantial gains that do not consistently surpass Back-translation. The work highlights the importance of data quality, language proximity, and careful evaluation in low-resource MT, and it provides reproducible results and a path forward through curated corpora and dialect-aware modeling. The practical impact lies in demonstrating viable translation for a border-region language pair and informing methodologies for similar low-resource settings, with code and data processing specifics made available for reproducibility.

Abstract

Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.

Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

TL;DR

This study addresses the challenge of neural machine translation for low-resource languages by examining bidirectional German–Bavarian translation with a Transformer baseline, Back-translation, and Transfer Learning. It leverages a diverse data pipeline (parallel and monolingual corpora) and a 5-fold cross-validation framework, evaluating with BLEU, chrF, and TER, and applying a Bonferroni-corrected significance threshold of . The findings show unexpectedly high baselines due to language similarity, with Back-translation delivering statistically significant improvements and Transfer Learning offering substantial gains that do not consistently surpass Back-translation. The work highlights the importance of data quality, language proximity, and careful evaluation in low-resource MT, and it provides reproducible results and a path forward through curated corpora and dialect-aware modeling. The practical impact lies in demonstrating viable translation for a border-region language pair and informing methodologies for similar low-resource settings, with code and data processing specifics made available for reproducibility.

Abstract

Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.
Paper Structure (31 sections, 4 tables)