MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, Arianna Bisazza
TL;DR
MultiBLiMP 1.0 delivers a massively multilingual minimal-pair benchmark spanning 101 languages to probe formal syntactic competence, using an automated UD/UniMorph-driven pipeline to generate over 128k pairs across two agreement phenomena. The authors evaluate 42 LLMs and reveal that model size and language exposure drive performance, while post-training can reduce multilingual grammatical knowledge, with a notable Indo-European bias in coverage. The work demonstrates a scalable approach to multilingual syntactic benchmarking and highlights continued value in UD/UniMorph resources for evaluation and typology, while outlining practical implications for model training and tokenization. Overall, MultiBLiMP provides a foundation for cross-linguistic syntactic analysis at scale and motivates targeted efforts to improve low-resource languages through better data, tokenization, and training strategies.
Abstract
We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.
