Table of Contents
Fetching ...

From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation

Mardiyyah Oduwole, Oluwatosin Olajide, Jamiu Suleiman, Faith Hunja, Busayo Awobade, Fatimo Adebanjo, Comfort Akanni, Chinonyelum Igwe, Peace Ododo, Promise Omoigui, Abraham Owodunni, Steven Kolawole

TL;DR

The paper tackles machine translation for African languages where parallel data is scarce, proposing to test two non-generative augmentation strategies. It evaluates sentence concatenation with back-translation and switchout on six language pairs using MaFAND and mBART. Results show substantial, language-dependent BLEU improvements across most language pairs, though some languages (e.g., Swahili) benefit less, and the optimal augmentation varies by language and method. This work demonstrates the practical potential of data augmentation to uplift under-resourced African MT and points to future work with larger models and generative augmentation to close remaining gaps.

Abstract

The linguistic diversity across the African continent presents different challenges and opportunities for machine translation. This study explores the effects of data augmentation techniques in improving translation systems in low-resource African languages. We focus on two data augmentation techniques: sentence concatenation with back translation and switch-out, applying them across six African languages. Our experiments show significant improvements in machine translation performance, with a minimum increase of 25\% in BLEU score across all six languages. We provide a comprehensive analysis and highlight the potential of these techniques to improve machine translation systems for low-resource languages, contributing to the development of more robust translation systems for under-resourced languages.

From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation

TL;DR

The paper tackles machine translation for African languages where parallel data is scarce, proposing to test two non-generative augmentation strategies. It evaluates sentence concatenation with back-translation and switchout on six language pairs using MaFAND and mBART. Results show substantial, language-dependent BLEU improvements across most language pairs, though some languages (e.g., Swahili) benefit less, and the optimal augmentation varies by language and method. This work demonstrates the practical potential of data augmentation to uplift under-resourced African MT and points to future work with larger models and generative augmentation to close remaining gaps.

Abstract

The linguistic diversity across the African continent presents different challenges and opportunities for machine translation. This study explores the effects of data augmentation techniques in improving translation systems in low-resource African languages. We focus on two data augmentation techniques: sentence concatenation with back translation and switch-out, applying them across six African languages. Our experiments show significant improvements in machine translation performance, with a minimum increase of 25\% in BLEU score across all six languages. We provide a comprehensive analysis and highlight the potential of these techniques to improve machine translation systems for low-resource languages, contributing to the development of more robust translation systems for under-resourced languages.

Paper Structure

This paper contains 14 sections, 3 tables.