Table of Contents
Fetching ...

Building a Functional Machine Translation Corpus for Kpelle

Kweku Andoh Yamoah, Jackson Weako, Emmanuel J. Dorley

TL;DR

This work tackles the paucity of NLP resources for Kpelle by introducing the first public English-Kpelle corpus, containing $3234$ translation pairs across everyday, religious, and educational content. By fine-tuning Meta's $NLLB$ on two dataset versions ($V1$ and $V2$), the study demonstrates substantial gains with data augmentation, achieving up to $\approx 30$ BLEU for $kpe\_Latn \rightarrow eng\_Latn$ and around $\approx 24$ for the reverse, illustrating robust gains in low-resource MT. The results align with $NLLB$-200 benchmarks for other African languages, indicating Kpelle's competitive potential given continued data expansion and orthography standardization. Beyond MT, the dataset supports broader NLP tasks and is accompanied by a roadmap emphasizing community validation and interdisciplinary collaboration to promote inclusive language technologies for Kpelle and other low-resource Mande languages.

Abstract

In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.

Building a Functional Machine Translation Corpus for Kpelle

TL;DR

This work tackles the paucity of NLP resources for Kpelle by introducing the first public English-Kpelle corpus, containing translation pairs across everyday, religious, and educational content. By fine-tuning Meta's on two dataset versions ( and ), the study demonstrates substantial gains with data augmentation, achieving up to BLEU for and around for the reverse, illustrating robust gains in low-resource MT. The results align with -200 benchmarks for other African languages, indicating Kpelle's competitive potential given continued data expansion and orthography standardization. Beyond MT, the dataset supports broader NLP tasks and is accompanied by a roadmap emphasizing community validation and interdisciplinary collaboration to promote inclusive language technologies for Kpelle and other low-resource Mande languages.

Abstract

In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.

Paper Structure

This paper contains 34 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of Liberian language family under the Niger-Congo Branch.
  • Figure 2: Sentence length distributions for English (top) and Kpelle (bottom), illustrating the corpus’s inherent variability.
  • Figure 3: NLLB-200 fine-tuning with Kpelle: (a) Model adaptation for bidirectional translation, and (b) a sample translation.