Table of Contents
Fetching ...

Feriji: A French-Zarma Parallel Corpus, Glossary & Translator

Mamadou K. Keita, Elysabhete Amadou Ibrahim, Habibatou Abdoulaye Alfari, Christopher Homan

TL;DR

Feriji addresses the underrepresentation of Zarma in machine translation by introducing the first robust French-Zarma parallel corpus and glossary. The study fine-tunes three multilingual models on the FD data, with M2M100 achieving the best BLEU of $30.06$ and favorable human judgments, validated through a dedicated evaluation. A Feriji Translator interface and a community-driven feedback process highlight practical value and accessibility challenges for illiterate users. Collectively, Feriji provides a critical resource and toolkit to advance Zarma MT research, education, healthcare translation, and cultural preservation.

Abstract

Machine translation (MT) is a rapidly expanding field that has experienced significant advancements in recent years with the development of models capable of translating multiple languages with remarkable accuracy. However, the representation of African languages in this field still needs to improve due to linguistic complexities and limited resources. This applies to the Zarma language, a dialect of Songhay (of the Nilo-Saharan language family) spoken by over 5 million people across Niger and neighboring countries \cite{lewis2016ethnologue}. This paper introduces Feriji, the first robust French-Zarma parallel corpus and glossary designed for MT. The corpus, containing 61,085 sentences in Zarma and 42,789 in French, and a glossary of 4,062 words represent a significant step in addressing the need for more resources for Zarma. We fine-tune three large language models on our dataset, obtaining a BLEU score of 30.06 on the best-performing model. We further evaluate the models on human judgments of fluency, comprehension, and readability and the importance and impact of the corpus and models. Our contributions help to bridge a significant language gap and promote an essential and overlooked indigenous African language.

Feriji: A French-Zarma Parallel Corpus, Glossary & Translator

TL;DR

Feriji addresses the underrepresentation of Zarma in machine translation by introducing the first robust French-Zarma parallel corpus and glossary. The study fine-tunes three multilingual models on the FD data, with M2M100 achieving the best BLEU of and favorable human judgments, validated through a dedicated evaluation. A Feriji Translator interface and a community-driven feedback process highlight practical value and accessibility challenges for illiterate users. Collectively, Feriji provides a critical resource and toolkit to advance Zarma MT research, education, healthcare translation, and cultural preservation.

Abstract

Machine translation (MT) is a rapidly expanding field that has experienced significant advancements in recent years with the development of models capable of translating multiple languages with remarkable accuracy. However, the representation of African languages in this field still needs to improve due to linguistic complexities and limited resources. This applies to the Zarma language, a dialect of Songhay (of the Nilo-Saharan language family) spoken by over 5 million people across Niger and neighboring countries \cite{lewis2016ethnologue}. This paper introduces Feriji, the first robust French-Zarma parallel corpus and glossary designed for MT. The corpus, containing 61,085 sentences in Zarma and 42,789 in French, and a glossary of 4,062 words represent a significant step in addressing the need for more resources for Zarma. We fine-tune three large language models on our dataset, obtaining a BLEU score of 30.06 on the best-performing model. We further evaluate the models on human judgments of fluency, comprehension, and readability and the importance and impact of the corpus and models. Our contributions help to bridge a significant language gap and promote an essential and overlooked indigenous African language.
Paper Structure (26 sections, 4 figures, 7 tables)

This paper contains 26 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Data Collection Process
  • Figure 2: Feriji Translator Beta Interface
  • Figure 3: Gender representation in the survey
  • Figure 4: Age representation in the survey