Table of Contents
Fetching ...

AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani

TL;DR

This work targets the underrepresentation of African languages in text Embedding benchmarks by introducing AfriMTEB, a large-scale African benchmark (59 languages, 38 datasets) plus a compact AfriMTEB-Lite (9 languages, 13 datasets). It pairs AfriMTEB with AfriE5, an adaptation of mE5-Large-Instruct via cross-lingual contrastive distillation using translated MNLI/SNLI data filtered by SSA-COMET, and supervised by a BGE Reranker. AfriE5 achieves state-of-the-art results across AfriMTEB-Full and AfriMTEB-Lite, outperforming baselines like Gemini Embedding and mE5, and demonstrating strong cross-lingual transfer from nine African languages to 59. The paper also provides ablations and a Lite benchmark to guide future development, emphasizing that broad language coverage and cross-lingual alignment can outweigh sheer model size for African-language embeddings.

Abstract

Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

TL;DR

This work targets the underrepresentation of African languages in text Embedding benchmarks by introducing AfriMTEB, a large-scale African benchmark (59 languages, 38 datasets) plus a compact AfriMTEB-Lite (9 languages, 13 datasets). It pairs AfriMTEB with AfriE5, an adaptation of mE5-Large-Instruct via cross-lingual contrastive distillation using translated MNLI/SNLI data filtered by SSA-COMET, and supervised by a BGE Reranker. AfriE5 achieves state-of-the-art results across AfriMTEB-Full and AfriMTEB-Lite, outperforming baselines like Gemini Embedding and mE5, and demonstrating strong cross-lingual transfer from nine African languages to 59. The paper also provides ablations and a Lite benchmark to guide future development, emphasizing that broad language coverage and cross-lingual alignment can outweigh sheer model size for African-language embeddings.

Abstract

Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

Paper Structure

This paper contains 61 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: AfriMTEB (Full): Model size vs. mean performance. Parameter counts (billions, log scale) are shown on the $x$-axis, and mean scores across AfriMTEB (Full) tasks on the $y$-axis. AfriE5-large-instruct (red) achieves the best overall performance (64.6) despite having far fewer parameters than most 7–8B models.
  • Figure 2: Overview of AfriMTEB and AfriMTEB-Lite. The Full suite spans 59 languages and 38 datasets across 7 families; the Lite suite provides uniform coverage for 9 languages and 13 datasets.
  • Figure 3: Performance on AfriMTEB-Lite across nine target languages. Bars show mean scores by language for four representative embedding models. AfriE5-large-instruct consistently achieves the highest or near-highest scores.