From Priest to Doctor: Domain Adaptation for Low-Resource Neural Machine Translation
Ali Marashian, Enora Rice, Luke Gessler, Alexis Palmer, Katharina von der Wense
TL;DR
This paper tackles domain adaptation for neural machine translation under a realistic low-resource regime, using only Bible-domain parallel data, high-resource monolingual target-domain text, and a bilingual dictionary to translate from English into several low-resource languages. It compares four methods (DALI, LeCA, CPT, and a Combined approach) with mBART as a baseline across government and medical domains, finding that the simple DALI method most consistently boosts performance, though overall scores remain modest and human evaluation reveals room for improvement. The work highlights the limitations of current DA techniques in truly low-resource settings and suggests that leveraging in-domain monolingual data with lexicon-informed pseudo-parallel data (as in DALI) is a promising direction, while emphasizing the need for more robust methods. Code and data resources are made available to support replication and further research in this challenging area.
Abstract
Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.
