Improving LLM-as-a-Judge Inference with the Judgment Distribution
Victor Wang, Michael J. Q. Zhang, Eunsol Choi
TL;DR
This work demonstrates that leveraging the full distribution of LLM judgments, rather than relying on a greedy token, yields superior performance across pointwise, pairwise, and listwise evaluation settings. It shows that mean-based inferences consistently outperform mode-based (greedy) approaches, and that risk-aware variants can further improve alignment with human preferences. The study also reveals that chain-of-thought prompting often sharpens distributions and can harm calibration, especially for smaller models, and provides concrete recommendations for selecting distributional inference methods and prompting strategies. Overall, the paper argues for adopting distributional outputs in LLM-based evaluation to achieve more accurate and calibrated judgments across diverse tasks and model sizes.
Abstract
Using language models to scalably approximate human preferences on text quality (LLM-as-a-judge) has become a standard practice applicable to many tasks. A judgment is often extracted from the judge's textual output alone, typically with greedy decoding. However, LLM judges naturally provide distributions over judgment tokens, inviting a breadth of inference methods for extracting fine-grained preferences. We find that taking the mean of the judgment distribution consistently outperforms taking the mode (i.e. greedy decoding) in all evaluation settings (i.e. pointwise, pairwise, and listwise). We further explore novel methods of deriving preferences from judgment distributions, and find that methods incorporating risk aversion often improve performance. Lastly, we analyze LLM-as-a-judge paired with chain-of-thought (CoT) prompting, showing that CoT can collapse the spread of the judgment distribution, often harming performance. Our findings show that leveraging distributional output improves LLM-as-a-judge, as opposed to using the text interface alone.
