Verdict: A Library for Scaling Judge-Time Compute
Nimit Kalra, Leonard Tang
TL;DR
Verdict tackles unreliable LLM-based evaluation by introducing a modular, composable framework of Units (e.g., verification, debate, aggregation) that scales judge-time compute. The approach enables structured, type-safe orchestration of complex evaluation pipelines, improving reliability, interpretability, and performance across content moderation, fact-checking, and hallucination detection. Empirical results show Verdict matching or exceeding large, fine-tuned or prompting-based judges on tasks like JudgeBench and ExpertQA, highlighting its potential as an efficient, scalable platform for automated evaluation. By unifying debate, verification, and aggregation under a scalable execution model, Verdict offers a practical foundation for robust AI evaluators in research and production settings.
Abstract
The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and practitioners.
