Pula: Training Large Language Models for Setswana
Nathan Brown, Vukosi Marivate
TL;DR
Pula addresses Setswana data scarcity by releasing a family of bilingual Setswana-English LLMs (Pula 1B, 3B, 8B, 14B) trained with LoRA/QLoRA on a unified mix of raw and instruction data. The core data assets Marothodi (largest Setswana text corpus) and Medupi (Setswana instruction-tuning) are openly released alongside training/evaluation tooling. Empirical results show strong English-Setwana translation performance, with Pula models outperforming GPT-4o and Gemini 1.5 Pro on translation and Setswana reasoning benchmarks at their size, and notable gains in cross-lingual reasoning. The work also provides MMLU-tsn and GSM8K-tsn benchmarks and sets the stage for broader Setswana NLP research through open datasets, benchmarks, and code.
Abstract
In this work we present Pula, a suite of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, Pula 8B and Pula 14B outperform GPT-4o and Gemini 1.5 Pro on English-Setswana translation tasks and achieve state-of-the-art performance on Setswana reasoning tasks for their size. We release the weights for Pula 1B, 3B, 8B, and 14B as well as training logs and training and evaluation code. Alongside Pula, we release the largest-ever Setswana text corpus, Marothodi, and the first comprehensive Setswana instruction-tuning dataset, Medupi, consisting of reformatted datasets, translated corpora, and synthetic LLM-generated text. To accompany this data, we release the code used for dataset construction, formatting, filtering, and scraping. Last, we release two Setswana LLM-translated benchmarks, MMLU-tsn and GSM8K-tsn, to measure Setswana knowledge and reasoning capabilities.
