Table of Contents
Fetching ...

Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness

Jiuhai Chen, Jonas Mueller

TL;DR

Unreliable outputs from black-box LLMs pose risk in high-stakes tasks due to hallucinations. The authors present BSDetector, a versatile uncertainty estimator that attaches a numerical confidence to any LLM output by combining Observed Consistency and Self-reflection Certainty, using diverse sampling and intrinsic evaluation. They show that BSDetector outperforms baselines on multiple QA benchmarks and can even improve the LLM's own answers by selecting the most confident among several samples. They further demonstrate that confidence-based evaluation improves the reliability of automated LLM-based assessments by enabling human-in-the-loop or by excluding low-confidence evaluations. The approach is practical for API-only LLM usage and offers a valuable pathway toward safer, more trustworthy AI systems.

Abstract

We introduce BSDetector, a method for detecting bad and speculative answers from a pretrained Large Language Model by estimating a numeric confidence score for any output it generated. Our uncertainty quantification technique works for any LLM accessible only via a black-box API, whose training data remains unknown. By expending a bit of extra computation, users of any LLM API can now get the same response as they would ordinarily, as well as a confidence estimate that cautions when not to trust this response. Experiments on both closed and open-form Question-Answer benchmarks reveal that BSDetector more accurately identifies incorrect LLM responses than alternative uncertainty estimation procedures (for both GPT-3 and ChatGPT). By sampling multiple responses from the LLM and considering the one with the highest confidence score, we can additionally obtain more accurate responses from the same LLM, without any extra training steps. In applications involving automated evaluation with LLMs, accounting for our confidence scores leads to more reliable evaluation in both human-in-the-loop and fully-automated settings (across both GPT 3.5 and 4).

Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness

TL;DR

Unreliable outputs from black-box LLMs pose risk in high-stakes tasks due to hallucinations. The authors present BSDetector, a versatile uncertainty estimator that attaches a numerical confidence to any LLM output by combining Observed Consistency and Self-reflection Certainty, using diverse sampling and intrinsic evaluation. They show that BSDetector outperforms baselines on multiple QA benchmarks and can even improve the LLM's own answers by selecting the most confident among several samples. They further demonstrate that confidence-based evaluation improves the reliability of automated LLM-based assessments by enabling human-in-the-loop or by excluding low-confidence evaluations. The approach is practical for API-only LLM usage and offers a valuable pathway toward safer, more trustworthy AI systems.

Abstract

We introduce BSDetector, a method for detecting bad and speculative answers from a pretrained Large Language Model by estimating a numeric confidence score for any output it generated. Our uncertainty quantification technique works for any LLM accessible only via a black-box API, whose training data remains unknown. By expending a bit of extra computation, users of any LLM API can now get the same response as they would ordinarily, as well as a confidence estimate that cautions when not to trust this response. Experiments on both closed and open-form Question-Answer benchmarks reveal that BSDetector more accurately identifies incorrect LLM responses than alternative uncertainty estimation procedures (for both GPT-3 and ChatGPT). By sampling multiple responses from the LLM and considering the one with the highest confidence score, we can additionally obtain more accurate responses from the same LLM, without any extra training steps. In applications involving automated evaluation with LLMs, accounting for our confidence scores leads to more reliable evaluation in both human-in-the-loop and fully-automated settings (across both GPT 3.5 and 4).
Paper Structure (29 sections, 3 equations, 6 figures, 4 tables)

This paper contains 29 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of our LLM uncertainty quantification technique.
  • Figure 2: ChatGPT is used to generate the answers to arithmetic problem "A tower is ..." with temperature sampling $T=1.0$. Subsequently, BSDetector is utilized to select the most confident answer from the three possible answers.
  • Figure 3: Confusion matrix comparing automated GPT-4 evaluations vs. human evaluations.
  • Figure 4: Human in the loop LLM-based evaluation, with the number of answers evaluated by humans varied along the x-axis (remaining answers are auto-evaluated by GPT-4). The resulting accuracy/MSE of the combined set of human + GPT-4 evaluations is shown along y-axis, under confidence-based vs. random selection to decide which subset of answers receive human evaluation.
  • Figure 5: Fully-automated GPT-4 based evaluation, assessing the accuracy/MSE over many replicate datasets (observed counts amongst replicates on y-axis). By discarding the bottom 20% of evaluations with the lowest confidence, the average GPT-4 evaluation score consistently reaches an accuracy of 1.0 on TriviaQA, indicating completely trustworthy LLM-based evaluations (and the MSE of the average GPT-4 score consistently improves compared to the full dataset or discarding a random 20%).
  • ...and 1 more figures