Table of Contents
Fetching ...

QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs

David Beauchemin, Pier-Luc Veilleux, Johanna-Pascale Roy, Richard Khoury

TL;DR

QFrBLiMP introduces the first Quebec-French linguistic minimal-pairs benchmark, comprising 1,761 pairs across 20 phenomena drawn from an official normative source (BDL). The dataset includes twelve human judgments per pair, enabling direct comparisons between human grammatical intuitions and 77 open-source LLMs evaluated via per-pair accuracy and perplexity-based ranking. Results show a robust scaling law: model size improves grammatical competence, but deep semantic phenomena (lexical semantics and orphaned prepositions) remain unsolved and significantly lag human judgment. The study finds that French specialization and instruction-tuning offer limited or even negative gains on formal grammar, and demonstrates a meaningful gap between LLMs and humans on tasks requiring semantic understanding, highlighting a key area for future alignment and data augmentation in Quebec-French NLP. The benchmark also correlates with MultiBLiMP while providing more focused, artifact-reduced evaluation of core syntactic abilities in Quebec-French.

Abstract

In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate the linguistic knowledge of LLMs on prominent grammatical phenomena in Quebec-French. QFrBLiMP consists of 1,761 minimal pairs annotated with 20 linguistic phenomena. Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Québec government institution. Each pair is annotated by twelve Quebec-French native speakers, who select the sentence they feel is grammatical amongst the two. These annotations are used to compare the competency of LLMs with that of humans. We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation and a significant gap compared to human performance on these specific tasks.

QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs

TL;DR

QFrBLiMP introduces the first Quebec-French linguistic minimal-pairs benchmark, comprising 1,761 pairs across 20 phenomena drawn from an official normative source (BDL). The dataset includes twelve human judgments per pair, enabling direct comparisons between human grammatical intuitions and 77 open-source LLMs evaluated via per-pair accuracy and perplexity-based ranking. Results show a robust scaling law: model size improves grammatical competence, but deep semantic phenomena (lexical semantics and orphaned prepositions) remain unsolved and significantly lag human judgment. The study finds that French specialization and instruction-tuning offer limited or even negative gains on formal grammar, and demonstrates a meaningful gap between LLMs and humans on tasks requiring semantic understanding, highlighting a key area for future alignment and data augmentation in Quebec-French NLP. The benchmark also correlates with MultiBLiMP while providing more focused, artifact-reduced evaluation of core syntactic abilities in Quebec-French.

Abstract

In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate the linguistic knowledge of LLMs on prominent grammatical phenomena in Quebec-French. QFrBLiMP consists of 1,761 minimal pairs annotated with 20 linguistic phenomena. Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Québec government institution. Each pair is annotated by twelve Quebec-French native speakers, who select the sentence they feel is grammatical amongst the two. These annotations are used to compare the competency of LLMs with that of humans. We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation and a significant gap compared to human performance on these specific tasks.

Paper Structure

This paper contains 97 sections, 1 equation, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Snipped of the translated BDL article for present participles "including" and "excluding".
  • Figure 2: Comparison between Model size and QFrBLiMP accuracy. The blue solid line represents a log-transformed linear data fit, while the green and red dashed line represents the human and random baselines respectively.
  • Figure 3: Comparison of LM performance on the MultiBLiMP and QFrBLiMP benchmarks. The blue solid line represents performance parity ($y=x$).
  • Figure 4: The Prodigy annotation interface (in French) used by the annotators to evaluate the minimal pairs.