COLE: a Comprehensive Benchmark for French Language Understanding Evaluation
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
TL;DR
COLE presents a 23-task French NLU benchmark evaluated across 95 LLMs to map the current landscape of French language understanding. It emphasizes three task groups (single-sentence, similarity/paraphrase, inference) and uses a GLUE-style composite score to enable cross-task comparisons, highlighting a pronounced divide between closed- and open-weight models. Key findings show strong semantic and grammatical capabilities but persistent gaps in zero-shot extractive QA, regional language variation, and fine-grained word sense disambiguation, with dialectal data (Québec French) driving notable performance gaps. The work provides a public resource to drive progress in French LLMs, while acknowledging limitations such as data contamination, reliance on translations, and the need for dialect-specific analyses to ensure robust, fair evaluation across French varieties.
Abstract
To address the need for a more comprehensive evaluation of French Natural Language Understanding (NLU), we introduce COLE, a new benchmark composed of 23 diverse task covering a broad range of NLU capabilities, including sentiment analysis, paraphrase detection, grammatical judgment, and reasoning, with a particular focus on linguistic phenomena relevant to the French language. We benchmark 94 large language models (LLM), providing an extensive analysis of the current state of French NLU. Our results highlight a significant performance gap between closed- and open-weights models and identify key challenging frontiers for current LLMs, such as zero-shot extractive question-answering (QA), fine-grained word sense disambiguation, and understanding of regional language variations. We release COLE as a public resource to foster further progress in French language modelling.
