SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

David Ifeoluwa Adelani; Hannah Liu; Xiaoyu Shen; Nikita Vassilyev; Jesujoba O. Alabi; Yanke Mao; Haonan Gao; Annie En-Shiun Lee

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, Annie En-Shiun Lee

TL;DR

SIB-200 offers a large, open, sentence-level topic classification benchmark spanning 200+ languages derived from Flores-200, enabling inclusive evaluation of multilingual NLUs beyond high-resource languages. The study systematically analyzes fully supervised, cross-lingual transfer, and zero-shot prompting across diverse languages, language families, regions, and PLMs, including region-specific models and MAFT adaptations. Key findings show substantial performance gaps for low-resource languages, the strong influence of pretraining language coverage and script, and notable gains from region-specific pretraining and MAFT, while prompting LLMs often underperforms compared to fine-tuning. The dataset and analyses aim to drive more equitable multilingual evaluation and guide future model development toward truly diverse language coverage.

Abstract

Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual language models on a more diverse set of languages. https://github.com/dadelani/sib-200

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

TL;DR

Abstract

Paper Structure (57 sections, 6 figures, 8 tables)

This paper contains 57 sections, 6 figures, 8 tables.

Introduction
SIB-200 dataset
Data source
Data annotation
Quality control
Choosing the final label per sentence
Final classification dataset
Experimental setup
Languages and their categorizations
Categorization by geographical regions
Categorization by language family
Categorization by Joshi's classification
Categorization by availability in PLM
Text classification models
Multi-Layer Perceptron
...and 42 more sections

Figures (6)

Figure 1: Heatmap of the performance by Region in each Joshi's class.
Figure 2: Fully supervised Model Performance. We group languages by whether they and their scripts are seen in the pre-training corpus of XLM-R. Languages are ordered by the XLM-R performance in every group.
Figure 3: Accuracy of the XLM-R model vs Pre-Training corpus size in the fully supervised scenario. Bigger pre-training corpus in a target language generally improves the model performance.
Figure 4: Script performance differences when one language has two different scripts. XLM-R and MLPs show the same trend. Using ngram features are more robust to script changes than using the XLM-R tokenizer.
Figure 5: Comparison of Various Scenarios. We group languages by whether they and their scripts are seen in the pre-training corpus of XLM-R. Languages are ordered by the XLM-R fully-supervised performance in every group.
...and 1 more figures

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

TL;DR

Abstract

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Authors

TL;DR

Abstract

Table of Contents

Figures (6)