Table of Contents
Fetching ...

Verdict: A Library for Scaling Judge-Time Compute

Nimit Kalra, Leonard Tang

TL;DR

Verdict tackles unreliable LLM-based evaluation by introducing a modular, composable framework of Units (e.g., verification, debate, aggregation) that scales judge-time compute. The approach enables structured, type-safe orchestration of complex evaluation pipelines, improving reliability, interpretability, and performance across content moderation, fact-checking, and hallucination detection. Empirical results show Verdict matching or exceeding large, fine-tuned or prompting-based judges on tasks like JudgeBench and ExpertQA, highlighting its potential as an efficient, scalable platform for automated evaluation. By unifying debate, verification, and aggregation under a scalable execution model, Verdict offers a practical foundation for robust AI evaluators in research and production settings.

Abstract

The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and practitioners.

Verdict: A Library for Scaling Judge-Time Compute

TL;DR

Verdict tackles unreliable LLM-based evaluation by introducing a modular, composable framework of Units (e.g., verification, debate, aggregation) that scales judge-time compute. The approach enables structured, type-safe orchestration of complex evaluation pipelines, improving reliability, interpretability, and performance across content moderation, fact-checking, and hallucination detection. Empirical results show Verdict matching or exceeding large, fine-tuned or prompting-based judges on tasks like JudgeBench and ExpertQA, highlighting its potential as an efficient, scalable platform for automated evaluation. By unifying debate, verification, and aggregation under a scalable execution model, Verdict offers a practical foundation for robust AI evaluators in research and production settings.

Abstract

The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and practitioners.

Paper Structure

This paper contains 24 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: In the Interactive Debate protocol khan2024debating, independent instances of a language model adopt opposing positions on an evaluation query (e.g., Is the following response harmless and helpful?). After each round, a separate model summarizes the debate. Verdict allows for straightforward implementation of this protocol through its declarative interface for defining and composing modules for parallel execution. This enables flexible scaling of test-time compute at both the component level (e.g., model capacity or role) and the architectural level (e.g., length or trace shape).
  • Figure 2: Primitives for linking Units together within and between Layers.
  • Figure 3: A simple Verdict judge surpasses reasoning models such as o1 (+9.28%) on the ExpertQA benchmark while operating at a fraction of the cost and latency. By explicitly programming the reasoning-trace structure for each evaluation task, Verdict judges can be calibrated to a fixed computational budget. As shown above, Verdict pipelines can be tuned to approach or extend the Pareto frontier relative to their constituent prompted judges.
  • Figure 4: A Verdict Pipeline for the Debate protocol. Leveraging ConversationalUnits, MapUnits, and JudgeUnits makes for a quick and easy implementation.
  • Figure 5: A Verdict Pipeline implementation of G-Eval using a CoTUnit, JudgeUnit, and MeanVariancePoolUnit. Structure is enforced via the Scale property, model parameters are easily managed on each Unit, and the prompt can be flexibly defined inline.