Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems
Xiaochuan Li, Ke Wang, Girija Gouda, Shubham Choudhary, Yaqun Wang, Linwei Hu, Joel Vaughan, Freddy Lecue
TL;DR
This paper addresses the challenge of scalable, trustworthy evaluation for large language models in high-stakes settings. It introduces LLM Jury-on-Demand, a dynamic framework that learns per-instance judge reliability, assembles an optimal jury for each data point, and weights judgments by predicted reliability to maximize alignment with human judgments. Through extensive experiments on summarization and RAG tasks, the approach consistently outperforms single-judge and static-jury baselines, and analyses reveal how data properties and task type influence judge reliability. The findings underscore the potential of adaptive, learning-based juries to improve evaluation quality and trustworthiness in real-world LLM deployments, while also outlining avenues for future work in generalization and calibration.
Abstract
As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.
