Table of Contents
Fetching ...

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Ezgi Başar, Francesca Padovani, Jaap Jumelet, Arianna Bisazza

TL;DR

TurBLiMP is the first Turkish benchmark of linguistic minimal pairs, assessing monolingual and multilingual language models across 16 phenomena with 1000 minimal pairs each, plus 2,000 paradigm pairs to probe word order and subordination. It leverages a three-stage benchmark creation pipeline (manual drafting, semi-automatic augmentation with masked Turkish LMs, and morphology-aware automatic augmentation) and includes 30 native speaker acceptability judgments to calibrate human perception. Key findings show that many models struggle on nontrivial Turkish syntactic phenomena, with monolingual Turkish models sometimes outperforming multilingual counterparts and with clear sensitivities to word order and morphological subordination that diverge from human judgments in places. TurBLiMP thus provides a typologically informed resource for diagnosing linguistic competence in Turkish and motivates further research into morphology-driven syntax evaluation and model robustness.

Abstract

We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

TL;DR

TurBLiMP is the first Turkish benchmark of linguistic minimal pairs, assessing monolingual and multilingual language models across 16 phenomena with 1000 minimal pairs each, plus 2,000 paradigm pairs to probe word order and subordination. It leverages a three-stage benchmark creation pipeline (manual drafting, semi-automatic augmentation with masked Turkish LMs, and morphology-aware automatic augmentation) and includes 30 native speaker acceptability judgments to calibrate human perception. Key findings show that many models struggle on nontrivial Turkish syntactic phenomena, with monolingual Turkish models sometimes outperforming multilingual counterparts and with clear sensitivities to word order and morphological subordination that diverge from human judgments in places. TurBLiMP thus provides a typologically informed resource for diagnosing linguistic competence in Turkish and motivates further research into morphology-driven syntax evaluation and model robustness.

Abstract

We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

Paper Structure

This paper contains 39 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Mean acceptability judgments for 16 TurBLiMP phenomena. Likert scale ratings are transformed to z-scores. Error bars show standard errors of the mean.
  • Figure 2: Correlation between the BERTurk model and human acceptability judgments across phenomena. (Pearson's $r = 0.65$, $p = 0.007$) Each data point corresponds to the average difference per phenomenon.
  • Figure 3: Informed consent form and instructions.
  • Figure 4: $\beta$ coefficients fitted for the BERTurk, EuroLLM, Goldfish, and Qwen 2.5 models with sentence length and subword count differences as the predictors.