Table of Contents
Fetching ...

Black-box Uncertainty Quantification Method for LLM-as-a-Judge

Nico Wagner, Michael Desmond, Rahul Nair, Zahra Ashktorab, Elizabeth M. Daly, Qian Pan, Martín Santillán Cooper, James M. Johnson, Werner Geyer

TL;DR

This paper introduces a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations and suggests that this method can significantly improve the reliability and consistency of LLM-as-a-Judge evaluations.

Abstract

LLM-as-a-Judge is a widely used method for evaluating the performance of Large Language Models (LLMs) across various tasks. We address the challenge of quantifying the uncertainty of LLM-as-a-Judge evaluations. While uncertainty quantification has been well-studied in other domains, applying it effectively to LLMs poses unique challenges due to their complex decision-making capabilities and computational demands. In this paper, we introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations. The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings. By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty. We evaluate our method across multiple benchmarks, demonstrating a strong correlation between the accuracy of LLM evaluations and the derived uncertainty scores. Our findings suggest that this method can significantly improve the reliability and consistency of LLM-as-a-Judge evaluations.

Black-box Uncertainty Quantification Method for LLM-as-a-Judge

TL;DR

This paper introduces a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations and suggests that this method can significantly improve the reliability and consistency of LLM-as-a-Judge evaluations.

Abstract

LLM-as-a-Judge is a widely used method for evaluating the performance of Large Language Models (LLMs) across various tasks. We address the challenge of quantifying the uncertainty of LLM-as-a-Judge evaluations. While uncertainty quantification has been well-studied in other domains, applying it effectively to LLMs poses unique challenges due to their complex decision-making capabilities and computational demands. In this paper, we introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations. The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings. By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty. We evaluate our method across multiple benchmarks, demonstrating a strong correlation between the accuracy of LLM evaluations and the derived uncertainty scores. Our findings suggest that this method can significantly improve the reliability and consistency of LLM-as-a-Judge evaluations.

Paper Structure

This paper contains 22 sections, 4 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: A biased assessment prompt. The LLM is prompted to asses a response (an input text that is under evaluation) under the assumption that a particular output option (label) is correct. By producing biased assessments, it is possible to determine the LLM's belief in a correct output option subject to assessments that may be contrary to this belief.
  • Figure 2: Method Overview. The method is divided into four stages, resulting in an uncertainty label. The LLM is first presented with an evaluation task, and prompted to produce an assessment for each output option, biased on the explicit indication that the option is correct. In the context of the original evaluation task, the LLM is conditioned on each of the biased assessments, and the probability of each option calculated using log probabilities. This information is then encoded in a confusion matrix. Each row of the matrix, representing the probability of a particular option conditioned on each of the biased assessments, is then averaged to produce an uncertainty label. In this figure, $\alpha$ represents the threshold.
  • Figure 3: Persuasion prompt generating an assessment for each option.
  • Figure 4: Structure of the confusion matrix. Each row represents an option and each column corresponds to an assessment, with the matrix values being the token probabilities for each option-assessment combination.
  • Figure 5: Confusion prompt forcing a final answer for each option and assessment from the previous step leading to $n^2$ prompts being used to obtain token log probabilities for each option and assessment combination.
  • ...and 6 more figures