Table of Contents
Fetching ...

LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

Michał Karp, Anna Kubaszewska, Magdalena Król, Robert Król, Aleksander Smywiński-Pohl, Mateusz Szymański, Witold Wydmański

TL;DR

This paper empirically evaluates whether contemporary LLMs can pass Poland's National Appeal Chamber qualification exam and whether an LLM-based examiner can reliably judge model outputs. It deploys a hybrid retrieval pipeline and three leading models (GPT-4.1, Claude 4 Sonnet, Bielik-11B-v2.6) across both a knowledge test and a formal judgment-writing task, under closed-book and retrieval-augmented settings. Results show models achieve limited success on the knowledge portion but fail to produce legally coherent judgments; LLM-based evaluation diverges from human judgments, often inflating scores and masking substantive deficiencies. The study concludes that current LLMs cannot replace human adjudicators, though they offer potential as augmentation tools under strict human oversight and iterative collaboration between legal and technical experts.

Abstract

This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwoławcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.

LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

TL;DR

This paper empirically evaluates whether contemporary LLMs can pass Poland's National Appeal Chamber qualification exam and whether an LLM-based examiner can reliably judge model outputs. It deploys a hybrid retrieval pipeline and three leading models (GPT-4.1, Claude 4 Sonnet, Bielik-11B-v2.6) across both a knowledge test and a formal judgment-writing task, under closed-book and retrieval-augmented settings. Results show models achieve limited success on the knowledge portion but fail to produce legally coherent judgments; LLM-based evaluation diverges from human judgments, often inflating scores and masking substantive deficiencies. The study concludes that current LLMs cannot replace human adjudicators, though they offer potential as augmentation tools under strict human oversight and iterative collaboration between legal and technical experts.

Abstract

This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwoławcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.

Paper Structure

This paper contains 56 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: LLM Prompt for Question Evaluation -- original (Polish) and translated to English. The Polish version was used for all LLMs.
  • Figure 2: A simplified version of § 2 of the Regulation of the Minister of Finance used in determining the expected fee. The text was originally in Polish, but it has been translated for the sake of linguistic consistency in this article.
  • Figure 3: JSON Schema used to extract deadline-related information. The content of the "description" keys was originally in Polish, but, as in Fig. \ref{['listing:simplified-regulation']}, they have been translated for the sake of linguistic consistency in this article.
  • Figure H: Prompt used to extract appealDate from the factual description. The part that can be called the prompt system is the default content of the n8n framework. It can be modified, but in this case we did not consider it necessary. A JSON Schema was provided with the model along with the prompt. The description of the facts provided at the input was put in the "Description of the facts in Polish" placeholder.
  • Figure I: JSON Schema used to extract appealDate from text.
  • ...and 3 more figures