Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski; Stephan Kleber

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

Abstract

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Abstract

32B parameters, and a few smaller models like Qwen2.5 14B.

Paper Structure (40 sections, 22 figures, 2 tables)

This paper contains 40 sections, 22 figures, 2 tables.

Introduction
Contributions
Related Work
Methodology
Datasets
Judge Models
Judge Prompt Templates
Parameters and Metrics
Evaluation of Conversational as Judges
Evaluation of Structured Outputs
Evaluation of Correctness in Two Stages
First Stage
Model Selection
Prompt Selection
Second Stage
...and 25 more sections

Figures (22)

Figure 1: Concept of an as a judge
Figure 2: Our evaluation workflow: 1. generating datasets (blue), 2. judging datasets (orange), 3. evaluate judge verdicts (green).
Figure 3: Overview of evaluated prompt types
Figure 4: $F_1$-scores of model/prompt combinations over all datasets, grouped by prompt type
Figure 5: Percent agreement over five runs, grouped by prompt type
...and 17 more figures

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Abstract

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Authors

Abstract

Table of Contents

Figures (22)