Table of Contents
Fetching ...

MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Dieuwke Hupkes, Nikolay Bogoychev

TL;DR

MultiLoKo tackles the challenge of robust multilingual evaluation by constructing a wide-coverage benchmark across 31 languages with locally sourced questions and fully parallel human and machine translations. Its five-step data collection pipeline emphasizes local relevance, rigorous QA, and controlled translations to enable analyses of language transfer, locality effects, and data-source biases. The study evaluates 11 multilingual-capable models, revealing substantial language gaps, limited cross-language transfer, and notable differences driven by question language and data sourcing; machine translations further shift language difficulty estimates and model rankings. The findings highlight the need for locally grounded benchmarks to accurately assess multilingual knowledge and inform model development and evaluation practices, while also offering a framework to study benchmark design choices such as translation vs. local data and human vs. machine translation. Overall, MultiLoKo provides a rigorous, extensible platform for analyzing multilinguality in LLMs and offers actionable insights into how data provenance affects language-specific performance and cross-language generalization.

Abstract

We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.

MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

TL;DR

MultiLoKo tackles the challenge of robust multilingual evaluation by constructing a wide-coverage benchmark across 31 languages with locally sourced questions and fully parallel human and machine translations. Its five-step data collection pipeline emphasizes local relevance, rigorous QA, and controlled translations to enable analyses of language transfer, locality effects, and data-source biases. The study evaluates 11 multilingual-capable models, revealing substantial language gaps, limited cross-language transfer, and notable differences driven by question language and data sourcing; machine translations further shift language difficulty estimates and model rankings. The findings highlight the need for locally grounded benchmarks to accurately assess multilingual knowledge and inform model development and evaluation practices, while also offering a framework to study benchmark design choices such as translation vs. local data and human vs. machine translation. Overall, MultiLoKo provides a rigorous, extensible platform for analyzing multilinguality in LLMs and offers actionable insights into how data provenance affects language-specific performance and cross-language generalization.

Abstract

We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.

Paper Structure

This paper contains 67 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: EM distributions and Gap dev. (a) Boxplot of observed EM scores for each model, sorted by mean. (b) Difference between the best EM and the worst of the N next best EM scores, per model.
  • Figure 2: Mother tongue effect dev. (a) Per language MTE for MultiLoKo dev, indicating the difference between questions asked in the mother tongue (locally relevant) and in English. Error bars indicate 2 times standard error across all models, excluding Claude 3.5 Sonnet. (b) KDE plot of the distribution of MTE scores for the top-3 performing models.
  • Figure 3: Consistency results dev. (a) Average per-model consistency scores, $\pm$ 2 times the standard error across languages. (b) Boxplot of model consistency scores per language, indicating the relative overlap of correctly answered questions when asked in the mother tongue vs in English.
  • Figure 4: Average EM per language dev, in mother tongue and English. Top: Average EM on locally sourced data. Bottom: Average EM on locally sourced data, translated to English.
  • Figure 5: Locality Effect dev. (a) Per language Locality Effect, indicating the difference in assigned scores between locally sourced and translated English data. A positive LE means the locally sourced data has a higher score (is easier), a negative LE the English sourced data has a higher score. (b) Per-model rank correlation between language difficulty of languages on locally sourced vs English translated data.
  • ...and 4 more figures