Table of Contents
Fetching ...

Toucan: Many-to-Many Translation for 150 African Language Pairs

AbdelRahim Elmadany, Ife Adebara, Muhammad Abdul-Mageed

TL;DR

Toucan tackles MT gaps for African languages by combining Afrocentric pretraining with a large-scale many-to-many translation setup. The work introduces Cheetah-1.2B and Cheetah-3.7B backbones, finetunes them into Toucan variants, and pairs them with AfroLingu-MT—the largest Africa-focused MT benchmark—alongside spBLEU1K for expansive evaluation. Empirical results show Toucan surpasses multiple baselines, including NLLB and Aya, across numerous language pairs, with larger models and broader multilingual coverage driving the gains. The contribution offers a practical, scalable pathway to improved MT for low-resource African languages and proposes tools to better evaluate and advance inclusivity in language technology across Africa.

Abstract

We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.

Toucan: Many-to-Many Translation for 150 African Language Pairs

TL;DR

Toucan tackles MT gaps for African languages by combining Afrocentric pretraining with a large-scale many-to-many translation setup. The work introduces Cheetah-1.2B and Cheetah-3.7B backbones, finetunes them into Toucan variants, and pairs them with AfroLingu-MT—the largest Africa-focused MT benchmark—alongside spBLEU1K for expansive evaluation. Empirical results show Toucan surpasses multiple baselines, including NLLB and Aya, across numerous language pairs, with larger models and broader multilingual coverage driving the gains. The contribution offers a practical, scalable pathway to improved MT for low-resource African languages and proposes tools to better evaluate and advance inclusivity in language technology across Africa.

Abstract

We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.
Paper Structure (27 sections, 2 figures, 9 tables)

This paper contains 27 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Toucan is a powerful MT model, proficiently trained on $156$ language pairs. It covers a wide spectrum of $43$ African languages as well as Arabic, English, and French
  • Figure 2: Examples from AfroLingu-MT benchmark train set.