Table of Contents
Fetching ...

An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang

TL;DR

The paper systematically analyzes uncertainty in model-based LLM evaluations, showing that evaluation stability varies with model family and data distribution. It demonstrates that prompting strategies, especially chain-of-thought, can mitigate uncertainty, and introduces ConfiLM, an uncertainty-aware evaluator trained with human data to improve out-of-distribution judgments. On a purpose-built Olympic 2024 OOD dataset, ConfiLM substantially boosts evaluation performance, highlighting the practical value of incorporating uncertainty into evaluation pipelines. The work also provides careful benchmarking, datasets, and code to promote reproducibility and further research into the stability of LLM-based evaluation.

Abstract

As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty.

An Empirical Analysis of Uncertainty in Large Language Model Evaluations

TL;DR

The paper systematically analyzes uncertainty in model-based LLM evaluations, showing that evaluation stability varies with model family and data distribution. It demonstrates that prompting strategies, especially chain-of-thought, can mitigate uncertainty, and introduces ConfiLM, an uncertainty-aware evaluator trained with human data to improve out-of-distribution judgments. On a purpose-built Olympic 2024 OOD dataset, ConfiLM substantially boosts evaluation performance, highlighting the practical value of incorporating uncertainty into evaluation pipelines. The work also provides careful benchmarking, datasets, and code to promote reproducibility and further research into the stability of LLM-based evaluation.

Abstract

As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty.

Paper Structure

This paper contains 23 sections, 2 equations, 13 figures, 33 tables.

Figures (13)

  • Figure 1: An example of uncertainty (i.e., model confidence) in model-based LLM evaluation. The evaluation process is influenced by the uncertainty of both the evaluator and the candidate model.
  • Figure 2: We conduct extensive experiments and analysis to investigate the existence, mitigation and utilization of uncertainty in model-based LLM evaluation. Uncertainty plays a key role in the evaluation process and can be leveraged to enhance the evaluator's performance in OOD scenarios.
  • Figure 3: Uncertainty analysis of single-answer grading under special prompting strategies on MT-Bench (first row) and PandaLM Test set (second row). We evaluate Llama2-7B-Instruct with default prompt, chain-of-thoughts and self-generated reference strategies. See Appendix \ref{['appendix:full_result']} for full results.
  • Figure 4: Uncertainty analysis of pairwise comparison under special prompting strategies on MT-Bench (first row) and PandaLM Test set (second row). "Win Rate" represents the proportion of non-tie cases where Llama2-7B-Instruct's response is better than Llama2-13B-Instruct's response. "Tie Rate" represents the proportion of tie cases.
  • Figure 5: Categories of test instances.
  • ...and 8 more figures