Table of Contents
Fetching ...

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Tuka Alhanai, Adam Kasumovic, Mohammad Ghassemi, Aven Zitzelberger, Jessica Lundin, Guillaume Chabot-Couture

TL;DR

This work tackles the equity gap in large language models by measuring and closing the performance gulf between English and eight low-resource African languages. It introduces a sizable benchmark suite created by translating Winogrande and three MMLU sections into Amharic, Bambara, Igbo, Sepedi, Shona, Sesotho, Setswana, and Tsonga, totaling roughly 1 million words and enabling direct cross-language evaluation. Through evaluations of state-of-the-art models and extensive fine-tuning experiments, the study shows average mono-lingual gains of about 5.6%, cross-lingual gains around 2.9%, and a 3.0% uplift from culturally appropriate data, with culture-aware evaluation further revealing up to 15.6% improvements on certain languages. The work demonstrates that high-quality, domain-aligned fine-tuning and culturally aware data creation can meaningfully reduce the LLM gap, and it provides publicly available benchmarks and code to foster ongoing progress toward inclusive language technologies for African language communities.

Abstract

Large Language Models (LLMs) have shown remarkable performance across various tasks, yet significant disparities remain for non-English languages, and especially native African languages. This paper addresses these disparities by creating approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages, covering a population of over 160 million speakers of: Amharic, Bambara, Igbo, Sepedi (Northern Sotho), Shona, Sesotho (Southern Sotho), Setswana, and Tsonga. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the translated benchmarks, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages. Finally, using results from over 400 fine-tuned models, we explore several methods to reduce the LLM performance gap, including high-quality dataset fine-tuning (using an LLM-as-an-Annotator), cross-lingual transfer, and cultural appropriateness adjustments. Key findings include average mono-lingual improvements of 5.6% with fine-tuning (with 5.4% average mono-lingual improvements when using high-quality data over low-quality data), 2.9% average gains from cross-lingual transfer, and a 3.0% out-of-the-box performance boost on culturally appropriate questions. The publicly available benchmarks, translations, and code from this study support further research and development aimed at creating more inclusive and effective language technologies.

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

TL;DR

This work tackles the equity gap in large language models by measuring and closing the performance gulf between English and eight low-resource African languages. It introduces a sizable benchmark suite created by translating Winogrande and three MMLU sections into Amharic, Bambara, Igbo, Sepedi, Shona, Sesotho, Setswana, and Tsonga, totaling roughly 1 million words and enabling direct cross-language evaluation. Through evaluations of state-of-the-art models and extensive fine-tuning experiments, the study shows average mono-lingual gains of about 5.6%, cross-lingual gains around 2.9%, and a 3.0% uplift from culturally appropriate data, with culture-aware evaluation further revealing up to 15.6% improvements on certain languages. The work demonstrates that high-quality, domain-aligned fine-tuning and culturally aware data creation can meaningfully reduce the LLM gap, and it provides publicly available benchmarks and code to foster ongoing progress toward inclusive language technologies for African language communities.

Abstract

Large Language Models (LLMs) have shown remarkable performance across various tasks, yet significant disparities remain for non-English languages, and especially native African languages. This paper addresses these disparities by creating approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages, covering a population of over 160 million speakers of: Amharic, Bambara, Igbo, Sepedi (Northern Sotho), Shona, Sesotho (Southern Sotho), Setswana, and Tsonga. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the translated benchmarks, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages. Finally, using results from over 400 fine-tuned models, we explore several methods to reduce the LLM performance gap, including high-quality dataset fine-tuning (using an LLM-as-an-Annotator), cross-lingual transfer, and cultural appropriateness adjustments. Key findings include average mono-lingual improvements of 5.6% with fine-tuning (with 5.4% average mono-lingual improvements when using high-quality data over low-quality data), 2.9% average gains from cross-lingual transfer, and a 3.0% out-of-the-box performance boost on culturally appropriate questions. The publicly available benchmarks, translations, and code from this study support further research and development aimed at creating more inclusive and effective language technologies.

Paper Structure

This paper contains 34 sections, 24 figures, 34 tables.

Figures (24)

  • Figure 1: GPT-4o Winogrande Performance on "appropriate" vs. "inappropriate" Data.GPT-4o was evaluated on Winogrande (test set) out-of-the-box in each target language and in English. Top plot: the absolute performance on QA pairs considered culturally "appropriate" and "inappropriate" according to native speakers. Bottom plot: performance lifts for each language (green) and in English (grey), using the same annotations. QA Pair was defined as "appropriate" when either annotator marked the cultural appropriateness of the question as "typical". Only QA pairs where both annotators reported that the translation quality was "good" or "understandable" were considered. Language codes are: Xhosa (xh), Igbo (ig), Tsonga (ts), Bambara (bm), Amharic (am), Setswana (tn), Sesotho (st), Zulu (zu), Afrikaans (af), Sepedi (nso), Shona (sn). See Table \ref{['table:appropriateness-3']} for a breakdown of performance by language. See Figure \ref{['fig:appropriateness_boxplots']} distributions when repeated random samples of the same size as the appropriate and inappropriate counts for each target language are drawn.
  • Figure 2: Mono- and Cross-lingual LLM Performance Gains.The figure displays boxplots of performance gains when fine-tuning with either the translated Winogrande train set (left) or MMLU college medicine section (right). The fine-tuned models were evaluated across 4 datasests (x-axis) for mono-lingual gains (blue) across 11 African languages, and cross-lingual gains (green) across 110 African language pairs. The most significant gains were with models fine-tuned with MMLU college medicine and evaluated on MMLU clinical knowledge. Wino: Winogrande, ck: clinical knowledge, vir: virology, Bele: Belebele. En: English.
  • Figure 3: LLM Performance Across Quality and Quantity Combinations.The figure displays LLM performance when fine-tuning by data quality and quantity, using MMLU college medicine and evaluating on MMLU clinical knowledge (which had the greatest mono-lingual gains from Figure \ref{['fig:boxplot-crosslingual']}). The quality of samples was rated using GPT-4o LLM-as-an-Annotator scores. The lowest tertile and highest tertile were defined as low (yellow) and high (green) quality samples, respectively, and were used to fine-tune Llama 3 70B IT. Boxplots display performance across 11 African languages. English (En) is provided as a reference (red). Overall, the use of high-quality fine-tuning data over low-quality fine-tuning data improved performance for African languages. See Table \ref{['table:quality-x-quantity-ts-mmlu-ck']} for a breakdown by language.
  • Figure A.1: Example Translation Task Form Given to Workers Hired on Upwork.com. Workers were tasked to translate each Winogrande QA pair into eight African languages. The translator was able to see the percentage of the task completed to aid in time management. The translator was also able to see any warnings regarding translation similarity to Google Translate output (see Appendix Section \ref{['sec:appendix-benchmark-translation']} for more information). The figure above shows an example (in Shona) with four rows filled in by a translator.
  • Figure A.2: Example Translation Review Task Form Given to Workers Hired on Upwork.com. Workers were tasked to review the Winogrande translations provided by a previous worker. The reviewer was able to see the percentage of the task completed to aid in time management. The reviewer was also able to see any warnings regarding translation similarity to Google Translate output (see Appendix Section \ref{['sec:appendix-benchmark-translation']} for more information). The figure above shows an example (in Shona) with four rows filled in by a reviewer.
  • ...and 19 more figures