Table of Contents
Fetching ...

Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

Benjamin Akera, Evelyn Nafula Ouma, Gilbert Yiga, Patrick Walukagga, Phionah Natukunda, Trevor Saaka, Solomon Nsumba, Lilian Teddy Nabukeera, Joel Muhanguzi, Imran Sekalala, Nimpamya Janat Namara, Engineer Bainomugisha, Ernest Mwebaze, John Quinn

TL;DR

The paper argues that centralised LLM development biases toward widely spoken languages and proposes a regionally-focused strategy by modeling all Ugandan languages with Sunflower 14B and 32B. It describes a three-stage training pipeline—continued pretraining on diverse Ugandan data, supervised fine-tuning with translation/instruction tasks, and reinforcement learning with Direct Preference Optimisation—to achieve strong translation and multilingual capabilities. Empirical results show Sunflower-32B achieving state-of-the-art translation performance in 24 of 31 Ugandan languages (local-to-English) and competitive results for English-to-local directions, with AfriMMLU indicating competitive reasoning on a subset of African languages. The work demonstrates that regionally coherent data collection, cultural grounding, and community feedback enable open-source models to outperform many larger baselines, offering a replicable path for expanding language coverage in other linguistically diverse regions.

Abstract

There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.

Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

TL;DR

The paper argues that centralised LLM development biases toward widely spoken languages and proposes a regionally-focused strategy by modeling all Ugandan languages with Sunflower 14B and 32B. It describes a three-stage training pipeline—continued pretraining on diverse Ugandan data, supervised fine-tuning with translation/instruction tasks, and reinforcement learning with Direct Preference Optimisation—to achieve strong translation and multilingual capabilities. Empirical results show Sunflower-32B achieving state-of-the-art translation performance in 24 of 31 Ugandan languages (local-to-English) and competitive results for English-to-local directions, with AfriMMLU indicating competitive reasoning on a subset of African languages. The work demonstrates that regionally coherent data collection, cultural grounding, and community feedback enable open-source models to outperform many larger baselines, offering a replicable path for expanding language coverage in other linguistically diverse regions.

Abstract

There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.

Paper Structure

This paper contains 23 sections, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Comparison of machine translation performance (mean chrF over xx$\rightarrow$eng and eng$\rightarrow$xx). Sunflower 32B has the highest accuracy in 24 out of 31 Ugandan languages.
  • Figure 2: Feedback solicited from community members through the online #breakthesystem campaign. This was particularly aimed at finding examples where translation performed poorly.
  • Figure 3: In-person testing of Sunflower by volunteers speaking several different Ugandan languages.