Table of Contents
Fetching ...

Improving LLM-as-a-Judge Inference with the Judgment Distribution

Victor Wang, Michael J. Q. Zhang, Eunsol Choi

TL;DR

This work demonstrates that leveraging the full distribution of LLM judgments, rather than relying on a greedy token, yields superior performance across pointwise, pairwise, and listwise evaluation settings. It shows that mean-based inferences consistently outperform mode-based (greedy) approaches, and that risk-aware variants can further improve alignment with human preferences. The study also reveals that chain-of-thought prompting often sharpens distributions and can harm calibration, especially for smaller models, and provides concrete recommendations for selecting distributional inference methods and prompting strategies. Overall, the paper argues for adopting distributional outputs in LLM-based evaluation to achieve more accurate and calibrated judgments across diverse tasks and model sizes.

Abstract

Using language models to scalably approximate human preferences on text quality (LLM-as-a-judge) has become a standard practice applicable to many tasks. A judgment is often extracted from the judge's textual output alone, typically with greedy decoding. However, LLM judges naturally provide distributions over judgment tokens, inviting a breadth of inference methods for extracting fine-grained preferences. We find that taking the mean of the judgment distribution consistently outperforms taking the mode (i.e. greedy decoding) in all evaluation settings (i.e. pointwise, pairwise, and listwise). We further explore novel methods of deriving preferences from judgment distributions, and find that methods incorporating risk aversion often improve performance. Lastly, we analyze LLM-as-a-judge paired with chain-of-thought (CoT) prompting, showing that CoT can collapse the spread of the judgment distribution, often harming performance. Our findings show that leveraging distributional output improves LLM-as-a-judge, as opposed to using the text interface alone.

Improving LLM-as-a-Judge Inference with the Judgment Distribution

TL;DR

This work demonstrates that leveraging the full distribution of LLM judgments, rather than relying on a greedy token, yields superior performance across pointwise, pairwise, and listwise evaluation settings. It shows that mean-based inferences consistently outperform mode-based (greedy) approaches, and that risk-aware variants can further improve alignment with human preferences. The study also reveals that chain-of-thought prompting often sharpens distributions and can harm calibration, especially for smaller models, and provides concrete recommendations for selecting distributional inference methods and prompting strategies. Overall, the paper argues for adopting distributional outputs in LLM-based evaluation to achieve more accurate and calibrated judgments across diverse tasks and model sizes.

Abstract

Using language models to scalably approximate human preferences on text quality (LLM-as-a-judge) has become a standard practice applicable to many tasks. A judgment is often extracted from the judge's textual output alone, typically with greedy decoding. However, LLM judges naturally provide distributions over judgment tokens, inviting a breadth of inference methods for extracting fine-grained preferences. We find that taking the mean of the judgment distribution consistently outperforms taking the mode (i.e. greedy decoding) in all evaluation settings (i.e. pointwise, pairwise, and listwise). We further explore novel methods of deriving preferences from judgment distributions, and find that methods incorporating risk aversion often improve performance. Lastly, we analyze LLM-as-a-judge paired with chain-of-thought (CoT) prompting, showing that CoT can collapse the spread of the judgment distribution, often harming performance. Our findings show that leveraging distributional output improves LLM-as-a-judge, as opposed to using the text interface alone.

Paper Structure

This paper contains 81 sections, 1 theorem, 6 equations, 2 figures, 28 tables.

Key Result

Proposition 1

We analyze the discrete methods in Table tab:pointwise-methods. Specifically, we examine the score function $r$ rather than ${\rm sgn}(r_1-r_2)$. Let $X$ be a random variable with support $S \subset [\frac{1}{2}, K + \frac{1}{2})$ for an integer $K$. Define its discretization $\hat{X}$ by $P(\hat{X}

Figures (2)

  • Figure 1: Pointwise LLM judge's logits produce a score distribution. We show two ways to compare two score distributions: (1) comparing the modes of the distributions and (2) comparing the means of the distributions.
  • Figure 2: Comparing pairwise LLM-as-a-judge prediction based on when to aggregate the two judgments, one from each response pair presentation order. Pre- vs. post-aggregation (bottom vs. top in figure) can be likened to mean vs. mode, as the former aggregates at the distribution level while the latter aggregates at the text level (if mode is used).

Theorems & Definitions (3)

  • Proposition 1
  • proof
  • Remark