Toucan: Many-to-Many Translation for 150 African Language Pairs
AbdelRahim Elmadany, Ife Adebara, Muhammad Abdul-Mageed
TL;DR
Toucan tackles MT gaps for African languages by combining Afrocentric pretraining with a large-scale many-to-many translation setup. The work introduces Cheetah-1.2B and Cheetah-3.7B backbones, finetunes them into Toucan variants, and pairs them with AfroLingu-MT—the largest Africa-focused MT benchmark—alongside spBLEU1K for expansive evaluation. Empirical results show Toucan surpasses multiple baselines, including NLLB and Aya, across numerous language pairs, with larger models and broader multilingual coverage driving the gains. The contribution offers a practical, scalable pathway to improved MT for low-resource African languages and proposes tools to better evaluate and advance inclusivity in language technology across Africa.
Abstract
We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.
