Interplay of Machine Translation, Diacritics, and Diacritization
Wei-Rui Chen, Ife Adebara, Muhammad Abdul-Mageed
TL;DR
This work systematically analyzes how diacritics and diacritization interact with machine translation across 55 languages, comparing high-resource and low-resource scenarios. It introduces four model paradigms (OnlyMTdia, OnlyMTundia, OnlyDia, DiaMT) to probe MT–diacritization and MT–diacritics interactions, across varied train sizes and datasets derived from Bible and Europarl corpora. Key findings reveal that diacritization boosts MT in low-resource settings but can degrade MT in high-resource settings, while MT largely harms diacritization except in select large-data cases; conversely, keeping diacritics in MT often has minimal impact. The authors also introduce two classes of language-agnostic diacritics complexity metrics (ratio- and entropy-based) that strongly correlate with diacritization performance, enabling predictive guidance for diacritization and MT system design in diverse resource regimes. Overall, the paper provides actionable insights and quantitative tools that generalize beyond the tested 55 languages, guiding multi-task and single-task MT/diacritization strategies and highlighting the role of diacritic complexity in system performance.
Abstract
We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of keeping (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European languages). For (1), results show that diacritization significantly benefits MT in the LR scenario, doubling or even tripling performance for some languages, but harms MT in the HR scenario. We find that MT harms diacritization in LR but benefits significantly in HR for some languages. For (2), MT performance is similar regardless of diacritics being kept or removed. In addition, we propose two classes of metrics to measure the complexity of a diacritical system, finding these metrics to correlate positively with the performance of our diacritization models. Overall, our work provides insights for developing MT and diacritization systems under different data size conditions and may have implications that generalize beyond the 55 languages we investigate.
