BLiMP: The Benchmark of Linguistic Minimal Pairs for English
Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, Samuel R. Bowman
TL;DR
BLiMP introduces a large-scale, automatically generated minimal-pairs benchmark to quantify English grammatical knowledge in language models. By organizing 67 paradigms into 12 phenomena and validating with human judgments, it provides both an overall LM score and fine-grained, phenomenon-level diagnostics. The study shows that current models reliably encode morphological agreement but struggle with semantically constrained licensing (NPIs, polarity items, quantifiers) and with island phenomena, with training data size emerging as a key driver of performance. BLiMP thus serves as a scalable, linguistically informed tool for tracking grammar knowledge across models and guiding future evaluation and development.
Abstract
We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars, and aggregate human agreement with the labels is 96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs. We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items and subtle syntactic phenomena such as extraction islands.
