Table of Contents
Fetching ...

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

Ekaterina Taktasheva, Maxim Bazhukov, Kirill Koncha, Alena Fenogenova, Ekaterina Artemova, Vladislav Mikhailov

Abstract

Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and carefully curating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

Abstract

Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and carefully curating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.
Paper Structure (104 sections, 2 equations, 6 figures, 12 tables)

This paper contains 104 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Overview of the RuBLiMP's minimal pair generation approach. Example: Vpervye kosmonavt spal v nevesomosti "For the first time an astronaut slept in zero gravity". (a) Extract sentences from publicly available corpora of Wikipedia texts, news articles, and books. (b) Annotate each extracted sentence in the Universal Dependencies scheme nivre-etal-2017-universal with a multidomain morphosyntactic parser for Russian anastasyev2020exploring. (c) Search the dependency trees for specific lexical units and linguistic structures and apply expert-written perturbation rules to create a pool of minimal pairs for a target paradigm. (d) Compute Min-K% Probshi2023detecting for each grammatical sentence in the pool using a set of LMs. Select $t$ (the threshold for the maximum Min-K% Prob value), which allows to find an intersection of 1k minimal pairs between the LMs. The minimal pairs in the intersection contain grammatical sentences that are not detected as the LMs' pretraining examples.
  • Figure 2: Distribution of phenomena in RuBLiMP.
  • Figure 3: $\Delta$-scores ($\downarrow$) for each LM and K%$\in \{30, 40, 50, 60\}$. All values are in %.
  • Figure 4: Results on RuBLiMP for the monolingual LMs per domain grouped by seven quintiles of the length.
  • Figure 5: Results on RuBLiMP for the multilingual encoder-only LMs per domain grouped by seven quintiles of the length.
  • ...and 1 more figures