Table of Contents
Fetching ...

Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole

Jacqueline Rowe, Edward Gow-Smith, Mark Hepple

TL;DR

This work addresses MT for Guinea-Bissau Creole (Kiriol), a low-resource creole with data dominated by religious texts. It introduces a ~40k-parallel-sentence dataset across Kiriol-English-Portuguese, combining religious data (Bible and JW) with a general-domain dictionary, and evaluates from-scratch Transformer models to study cross-domain transfer. Key findings show that adding a few hundred to ~600 target-domain sentences substantially improves domain-general translation, while lexical overlap with the lexifier language (Portuguese) and shared embeddings further boost performance, particularly for Kir-Por directions. The results offer practical guidance for data collection and model design in creole MT, and underscore the need for community-aware, data-rich development to enable robust language technologies for Kiriol and similar creoles.

Abstract

We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole

TL;DR

This work addresses MT for Guinea-Bissau Creole (Kiriol), a low-resource creole with data dominated by religious texts. It introduces a ~40k-parallel-sentence dataset across Kiriol-English-Portuguese, combining religious data (Bible and JW) with a general-domain dictionary, and evaluates from-scratch Transformer models to study cross-domain transfer. Key findings show that adding a few hundred to ~600 target-domain sentences substantially improves domain-general translation, while lexical overlap with the lexifier language (Portuguese) and shared embeddings further boost performance, particularly for Kir-Por directions. The results offer practical guidance for data collection and model design in creole MT, and underscore the need for community-aware, data-rich development to enable robust language technologies for Kiriol and similar creoles.

Abstract

We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

Paper Structure

This paper contains 16 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Average performance of Portuguese-Kiriol, Kiriol-Portuguese, Kiriol-English and English-Kiriol models trained on different portions of Bible and WT data when used to translate test set of 1,000 domain-general dictionary sentences. Standard errors across model sets are shown with error bars.
  • Figure 2: Average performance of Portuguese-Kiriol, Kiriol-Portuguese, Kiriol-English and English-Kiriol models trained on Bible, WT and different combinations of domain-general data when used to translate test set of 1,000 domain-general dictionary sentences. Standard errors across model sets are shown with error bars, and the baseline average performance of models trained only on Bible and WT data is shown with dotted lines.
  • Figure 3: Average scores across all language directions of human judgements for accuracy (solid) and fluency (hatched) of translated sentences from the reference sets (control) and from models trained on Bible and WT data (BWT) and Bible, WT and 600 dictionary sentences. Standard errors across all judgements for each condition are shown with error bars.
  • Figure 4: Average BLEU on test set using shared or separate embeddings, with combined (solid) and separate (hatched) tokenisers of vocabulary size 10k. Standard errors across model sets shown with error bars.
  • Figure 5: Improvements in average BLEU on test set by models using shared embeddings compared to models using separate embeddings. Results averaged across combined and separate tokeniser conditions.
  • ...and 3 more figures