Table of Contents
Fetching ...

A Set of Quebec-French Corpus of Regional Expressions and Terms

David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury

TL;DR

This work introduces two Quebecois French idiom benchmarks, QFrCoRE and QFrCoRT, to probe dialect understanding via idioms. It provides a replicable data collection methodology, including manual curation and AI-generated distractors with quality checks, and evaluates 94 LLMs in a zero-shot setting, revealing that many models struggle with regional lexical knowledge and that accessibility (proprietary vs open) strongly impacts performance. Key findings show that dialetic competence is not tightly tied to model size or generic reasoning capabilities, and that French-language fine-tuning can harm performance on Quebecois idioms. The paper argues for dialect-focused benchmarks to quantify the dialect gap and outlines future work to extend to other dialects, incorporate human baselines, and address ethical considerations around data provenance and representational harm.

Abstract

The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose two new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 94 LLM demonstrate that our regional idiom benchmarks are a reliable tool for measuring a model's proficiency in a specific dialect.

A Set of Quebec-French Corpus of Regional Expressions and Terms

TL;DR

This work introduces two Quebecois French idiom benchmarks, QFrCoRE and QFrCoRT, to probe dialect understanding via idioms. It provides a replicable data collection methodology, including manual curation and AI-generated distractors with quality checks, and evaluates 94 LLMs in a zero-shot setting, revealing that many models struggle with regional lexical knowledge and that accessibility (proprietary vs open) strongly impacts performance. Key findings show that dialetic competence is not tightly tied to model size or generic reasoning capabilities, and that French-language fine-tuning can harm performance on Quebecois idioms. The paper argues for dialect-focused benchmarks to quantify the dialect gap and outlines future work to extend to other dialects, incorporate human baselines, and address ethical considerations around data provenance and representational harm.

Abstract

The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose two new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 94 LLM demonstrate that our regional idiom benchmarks are a reliable tool for measuring a model's proficiency in a specific dialect.

Paper Structure

This paper contains 42 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The prompt templates (translated from French) used for the zero-shot evaluation of our two benchmarks. Each prompt consists of a system message providing the instruction and a user message containing the input placeholder for the data instance. Blue boxes contain the task instructions. Yellow boxes contain the prefix for the model to continue. Texts in "$\ll\gg$" are role-tags to be fed to the model.
  • Figure 2: Accuracy plot of all 94 models tested, we present performance on QFrCoRT (x-axis) and QFrCoRE (y-axis). Black dashed lines are our Random baseline scores. Red dots are models that performed poorer than the baseline on one of the corpora, green dots are models that performed better than 65% on both corpora, while blue dots are those that do not fit in the two other performance classes. Scores are accuracy (Acc.) (%).