Table of Contents
Fetching ...

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, Arianna Bisazza

TL;DR

MultiBLiMP 1.0 delivers a massively multilingual minimal-pair benchmark spanning 101 languages to probe formal syntactic competence, using an automated UD/UniMorph-driven pipeline to generate over 128k pairs across two agreement phenomena. The authors evaluate 42 LLMs and reveal that model size and language exposure drive performance, while post-training can reduce multilingual grammatical knowledge, with a notable Indo-European bias in coverage. The work demonstrates a scalable approach to multilingual syntactic benchmarking and highlights continued value in UD/UniMorph resources for evaluation and typology, while outlining practical implications for model training and tokenization. Overall, MultiBLiMP provides a foundation for cross-linguistic syntactic analysis at scale and motivates targeted efforts to improve low-resource languages through better data, tokenization, and training strategies.

Abstract

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

TL;DR

MultiBLiMP 1.0 delivers a massively multilingual minimal-pair benchmark spanning 101 languages to probe formal syntactic competence, using an automated UD/UniMorph-driven pipeline to generate over 128k pairs across two agreement phenomena. The authors evaluate 42 LLMs and reveal that model size and language exposure drive performance, while post-training can reduce multilingual grammatical knowledge, with a notable Indo-European bias in coverage. The work demonstrates a scalable approach to multilingual syntactic benchmarking and highlights continued value in UD/UniMorph resources for evaluation and typology, while outlining practical implications for model training and tokenization. Overall, MultiBLiMP provides a foundation for cross-linguistic syntactic analysis at scale and motivates targeted efforts to improve low-resource languages through better data, tokenization, and training strategies.

Abstract

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

Paper Structure

This paper contains 44 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Gemma3-27B accuracy per language on MultiBLiMP 1.0, plotted against language frequency in Common Crawl. Accuracy is measured based on the model assigning a higher probability to a grammatical sentence over a minimally different but ungrammatical sentence. Languages are coloured by their positive or negative deviation from the general trend of accuracy increasing with corpus frequency, highlighting languages that over- or underperform relative to the amount of resources available for them.
  • Figure 2: Pipeline of the minimal pair creation procedure of MultiBLiMP 1.0
  • Figure 3: Number of minimal pairs per language in MultiBLiMP, split out for certain and uncertain agreement cases using the agreement detection procedure of §\ref{['sec:agreement']}. Note the log-scale on the y-axis.
  • Figure 4: Distribution of language families present in MultiBLiMP 1.0. See Appendix \ref{['app:lang_distribution']} for a detailed version of this figure, including the individual languages.
  • Figure 5: The impact of model size (in number of parameters on a log scale) against overall MultiBLiMP accuracy for the Llama3, Qwen3, Gemma3, and OLMo2 model families.
  • ...and 3 more figures