Table of Contents
Fetching ...

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut

TL;DR

This work introduces INCLUDE, a large-scale multilingual benchmark designed to evaluate regional knowledge in 44 languages using native-language exams. It assembles 197,243 MCQA across 1,926 exams from 52 countries, combining newly collected data with existing non-English benchmarks to span 58 knowledge domains. The study reveals substantial cross-language variability and regional knowledge gaps, with model performance strongly influenced by language exposure, script transfer, and prompt design, while larger models and non-English pretraining generally improve results. It also foregrounds evaluation challenges in multilingual settings, such as format adherence and data contamination, and provides two ready-to-use subsets (Include-base and Include-lite) to enable broad participation and incremental release to mitigate leakage. Overall, INCLUDE offers a rigorous framework for assessing and advancing regional-understanding capabilities in multilingual LLMs in the environments where they would actually be used.

Abstract

The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

TL;DR

This work introduces INCLUDE, a large-scale multilingual benchmark designed to evaluate regional knowledge in 44 languages using native-language exams. It assembles 197,243 MCQA across 1,926 exams from 52 countries, combining newly collected data with existing non-English benchmarks to span 58 knowledge domains. The study reveals substantial cross-language variability and regional knowledge gaps, with model performance strongly influenced by language exposure, script transfer, and prompt design, while larger models and non-English pretraining generally improve results. It also foregrounds evaluation challenges in multilingual settings, such as format adherence and data contamination, and provides two ready-to-use subsets (Include-base and Include-lite) to enable broad participation and incremental release to mitigate leakage. Overall, INCLUDE offers a rigorous framework for assessing and advancing regional-understanding capabilities in multilingual LLMs in the environments where they would actually be used.

Abstract

The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.

Paper Structure

This paper contains 29 sections, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Overview ofInclude. (a) Motivation: Multilingual benchmarks must reflect the cultural and regional knowledge of the language environments in which they would used. (b) Include is a multilingual benchmark compiled from academic, professional, and occupational license examinations reflecting regional and cultural knowledge in 44 languages.
  • Figure 2: Overview of the collected data grouped by script. We depict the languages associated with each script, the total samples in each script, and the percentage of the samples that were collected from new sources that have not been published by the community yet.
  • Figure 3: Performance of models stratified by language using in-language prompting. Results are grouped by whether the language was explicitly included in the pretraining dataset of the model (Trained on Language), whether a similar language with the same script was in the pretraining corpus (Trained on Script), or whether there was no linguistically similar language in the pretraining corpus (Neither). Color dotted lines represent average performance for each category for a particular model. Black dotted lines represent average performance across all script-aligned languages.
  • Figure 4: GPT-4o performance (In-language Prompt) on regional history exams (cultural) and global history exams from that region (region-implicit) based on a total of 11,148 questions from Include. In each language (except Telugu), models perform better on the global history exam than the regional history exam.
  • Figure 5: GPT-4o performance across academic disciplines for Korean, Persian, Armenian, Hindi, Greek, and Russian. Each bar is annotated with the number of questions with correct answers.
  • ...and 4 more figures