LLMzSzŁ: a comprehensive LLM benchmark for Polish

Krzysztof Jassem; Michał Ciesiółka; Filip Graliński; Piotr Jabłoński; Jakub Pokrywka; Marek Kubis; Monika Jabłońska; Ryszard Staruch

LLMzSzŁ: a comprehensive LLM benchmark for Polish

Krzysztof Jassem, Michał Ciesiółka, Filip Graliński, Piotr Jabłoński, Jakub Pokrywka, Marek Kubis, Monika Jabłońska, Ryszard Staruch

TL;DR

LLMzSzŁ presents a comprehensive Polish LLM benchmark built from nationwide exams published by the Polish Central Examination Board, spanning four exam categories across 154 domains with about $19{,}000$ closed-ended questions and time stamps to control data leakage. The authors evaluate a broad set of open-weight multilingual, English, and Polish LLMs using an MMLU-like evaluation harness, analyzing performance by model size, language, release date, and instruction tuning, and comparing against human pass rates. They find that multilingual models generally surpass monolinguals, with larger models and instruction-tuned variants delivering the strongest results, while small Polish-specific models provide better cost-performance trade-offs. The work also explores the potential of LLMs to aid in exam validation by identifying anomalies and errors in questions, though it acknowledges limitations such as data contamination risk and the restriction to formal, closed-question tasks. Overall, the benchmark represents the largest, time-stamped, exam-based resource for evaluating Polish LLMs and offers practical insights for education-tech applications and future benchmark development.

Abstract

This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSzŁ (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k closed-ended questions. We investigate the performance of open-source multilingual, English, and Polish LLMs to verify LLMs' abilities to transfer knowledge between languages. Also, the correlation between LLMs and humans at model accuracy and exam pass rate levels is examined. We show that multilingual LLMs can obtain superior results over monolingual ones; however, monolingual models may be beneficial when model size matters. Our analysis highlights the potential of LLMs in assisting with exam validation, particularly in identifying anomalies or errors in examination tasks.

LLMzSzŁ: a comprehensive LLM benchmark for Polish

TL;DR

LLMzSzŁ presents a comprehensive Polish LLM benchmark built from nationwide exams published by the Polish Central Examination Board, spanning four exam categories across 154 domains with about

closed-ended questions and time stamps to control data leakage. The authors evaluate a broad set of open-weight multilingual, English, and Polish LLMs using an MMLU-like evaluation harness, analyzing performance by model size, language, release date, and instruction tuning, and comparing against human pass rates. They find that multilingual models generally surpass monolinguals, with larger models and instruction-tuned variants delivering the strongest results, while small Polish-specific models provide better cost-performance trade-offs. The work also explores the potential of LLMs to aid in exam validation by identifying anomalies and errors in questions, though it acknowledges limitations such as data contamination risk and the restriction to formal, closed-question tasks. Overall, the benchmark represents the largest, time-stamped, exam-based resource for evaluating Polish LLMs and offers practical insights for education-tech applications and future benchmark development.

Abstract

Paper Structure (26 sections, 2 figures, 8 tables)

This paper contains 26 sections, 2 figures, 8 tables.

Introduction
Related Work
Dataset
Dataset preparation
Evaluation harness
Dataset availability
Evaluation Results
Model size
Model language
Release date
Instruct vs non-instruct models
Detailed evaluation analysis
Evaluating the human-prepared exams by LLMs
Middle school exams
High school exams
...and 11 more sections

Figures (2)

Figure 1: Models' accuracy against their size. The points are jittered in the X-axis for better readability. The red dotted line represents the random guess baseline.
Figure 2: Plot showing the model's accuracy against their release date.

LLMzSzŁ: a comprehensive LLM benchmark for Polish

TL;DR

Abstract

LLMzSzŁ: a comprehensive LLM benchmark for Polish

Authors

TL;DR

Abstract

Table of Contents

Figures (2)