LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal
Michał Karp, Anna Kubaszewska, Magdalena Król, Robert Król, Aleksander Smywiński-Pohl, Mateusz Szymański, Witold Wydmański
TL;DR
This paper empirically evaluates whether contemporary LLMs can pass Poland's National Appeal Chamber qualification exam and whether an LLM-based examiner can reliably judge model outputs. It deploys a hybrid retrieval pipeline and three leading models (GPT-4.1, Claude 4 Sonnet, Bielik-11B-v2.6) across both a knowledge test and a formal judgment-writing task, under closed-book and retrieval-augmented settings. Results show models achieve limited success on the knowledge portion but fail to produce legally coherent judgments; LLM-based evaluation diverges from human judgments, often inflating scores and masking substantive deficiencies. The study concludes that current LLMs cannot replace human adjudicators, though they offer potential as augmentation tools under strict human oversight and iterative collaboration between legal and technical experts.
Abstract
This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwoławcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.
