Table of Contents
Fetching ...

Evaluating Polish linguistic and cultural competency in large language models

Sławomir Dadas, Małgorzata Grębowiec, Michał Perełkiewicz, Rafał Poświata

TL;DR

The paper tackles the challenge of assessing Polish linguistic and cultural competence in LLMs, arguing that cultural context is essential for accurate language understanding. It introduces a 600-question benchmark spanning six categories (history, geography, culture & tradition, art & entertainment, grammar, vocabulary) and employs a deterministic, rule-based grading system with rigorous normalization to verify answers. Evaluating over 30 open-weight and commercial LLMs, the study finds commercial models generally outperform open-weight counterparts, with top performers around 83% accuracy, while language-specific models like Bielik-2.3 show substantial gains due to Polish-focused pretraining. The benchmark, accompanied by a public leaderboard, provides a cost-effective, trackable means to monitor progress in Polish linguistic and cultural competence and highlights the value of language-centric data in improving cultural understanding in LLMs.

Abstract

Large language models (LLMs) are becoming increasingly proficient in processing and generating multilingual texts, which allows them to address real-world problems more effectively. However, language understanding is a far more complex issue that goes beyond simple text analysis. It requires familiarity with cultural context, including references to everyday life, historical events, traditions, folklore, literature, and pop culture. A lack of such knowledge can lead to misinterpretations and subtle, hard-to-detect errors. To examine language models' knowledge of the Polish cultural context, we introduce the Polish linguistic and cultural competency benchmark, consisting of 600 manually crafted questions. The benchmark is divided into six categories: history, geography, culture & tradition, art & entertainment, grammar, and vocabulary. As part of our study, we conduct an extensive evaluation involving over 30 open-weight and commercial LLMs. Our experiments provide a new perspective on Polish competencies in language models, moving past traditional natural language processing tasks and general knowledge assessment.

Evaluating Polish linguistic and cultural competency in large language models

TL;DR

The paper tackles the challenge of assessing Polish linguistic and cultural competence in LLMs, arguing that cultural context is essential for accurate language understanding. It introduces a 600-question benchmark spanning six categories (history, geography, culture & tradition, art & entertainment, grammar, vocabulary) and employs a deterministic, rule-based grading system with rigorous normalization to verify answers. Evaluating over 30 open-weight and commercial LLMs, the study finds commercial models generally outperform open-weight counterparts, with top performers around 83% accuracy, while language-specific models like Bielik-2.3 show substantial gains due to Polish-focused pretraining. The benchmark, accompanied by a public leaderboard, provides a cost-effective, trackable means to monitor progress in Polish linguistic and cultural competence and highlights the value of language-centric data in improving cultural understanding in LLMs.

Abstract

Large language models (LLMs) are becoming increasingly proficient in processing and generating multilingual texts, which allows them to address real-world problems more effectively. However, language understanding is a far more complex issue that goes beyond simple text analysis. It requires familiarity with cultural context, including references to everyday life, historical events, traditions, folklore, literature, and pop culture. A lack of such knowledge can lead to misinterpretations and subtle, hard-to-detect errors. To examine language models' knowledge of the Polish cultural context, we introduce the Polish linguistic and cultural competency benchmark, consisting of 600 manually crafted questions. The benchmark is divided into six categories: history, geography, culture & tradition, art & entertainment, grammar, and vocabulary. As part of our study, we conduct an extensive evaluation involving over 30 open-weight and commercial LLMs. Our experiments provide a new perspective on Polish competencies in language models, moving past traditional natural language processing tasks and general knowledge assessment.

Paper Structure

This paper contains 7 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Example showing various types of errors made by a multilingual model that was undertrained on Polish data. The user prompt is: How do you protect yourself from cold and flu? (Jak chronić się przed przeziębieniem i grypą?). In the model's response, we can observe the following errors: a) morphological (green) - using the wrong form of a word; b) lexical (orange) - using a word that does not fit the context; c) word-formation (blue) - attempt to create a new word by imitating a word in another language, usually English; d) changing the language of the answer completely (yellow). The example includes snippets of the response generated by the Qwen2.5 7B model, using the interface provided by the model's authors. For larger Qwen2.5 models the frequency of such errors decreases.
  • Figure 2: Distribution of questions by category and subcategory in our benchmark.
  • Figure 3: Comparison of benchmark scores for different versions of the same model.