Table of Contents
Fetching ...

Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

Xiaochuan Li, Ke Wang, Girija Gouda, Shubham Choudhary, Yaqun Wang, Linwei Hu, Joel Vaughan, Freddy Lecue

TL;DR

This paper addresses the challenge of scalable, trustworthy evaluation for large language models in high-stakes settings. It introduces LLM Jury-on-Demand, a dynamic framework that learns per-instance judge reliability, assembles an optimal jury for each data point, and weights judgments by predicted reliability to maximize alignment with human judgments. Through extensive experiments on summarization and RAG tasks, the approach consistently outperforms single-judge and static-jury baselines, and analyses reveal how data properties and task type influence judge reliability. The findings underscore the potential of adaptive, learning-based juries to improve evaluation quality and trustworthiness in real-world LLM deployments, while also outlining avenues for future work in generalization and calibration.

Abstract

As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.

Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

TL;DR

This paper addresses the challenge of scalable, trustworthy evaluation for large language models in high-stakes settings. It introduces LLM Jury-on-Demand, a dynamic framework that learns per-instance judge reliability, assembles an optimal jury for each data point, and weights judgments by predicted reliability to maximize alignment with human judgments. Through extensive experiments on summarization and RAG tasks, the approach consistently outperforms single-judge and static-jury baselines, and analyses reveal how data properties and task type influence judge reliability. The findings underscore the potential of adaptive, learning-based juries to improve evaluation quality and trustworthiness in real-world LLM deployments, while also outlining avenues for future work in generalization and calibration.

Abstract

As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.

Paper Structure

This paper contains 39 sections, 22 figures, 24 tables.

Figures (22)

  • Figure 1: Overview of the LLM Jury-on-Demand inference pipeline. The system extracts features from input texts to predict judge reliability, dynamically assembles a jury of the top $K$ most reliable judges for each instance, and calculates a final weighted score.
  • Figure 2: Overall performance comparison over 10 runs. Boxplot of Kendall’s Tau correlation between each evaluation method’s scores and human judgements, aggregated across all datasets for the 6 task-metric combinations. Our Jury-on-Demand system achieves the highest median correlation in nearly all categories and shows the most robust performance.
  • Figure 3: Selection frequency of the judge in the jury. Top k judge means that the judge has the k-th highest reliability score in the jury. Claude 3.7 Sonnet and DeepSeek R1 are favored in completeness, while Gemini 2.5 Flash is more often selected for groundedness.
  • Figure 4:
  • Figure 5: RAG Groundedness: Distribution of annotation scores across response lengths. Longer responses tend to receive more score 1s (moderately ungrounded), while shorter responses are more often assigned scores of 0 (severely ungrounded) or 2 (fully grounded).
  • ...and 17 more figures