Table of Contents
Fetching ...

Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation

Vaishnavi Shrivastava, Percy Liang, Ananya Kumar

TL;DR

This work tackles confidence estimation for state-of-the-art LLMs that do not expose internal probabilities by comparing linguistic confidences with surrogate-model confidences and proposing mixtures to combine signals. It shows that surrogate confidences, even from weaker models, can outperform linguistic confidences, and that combining signals yields state-of-the-art selective classification performance across 12 datasets, achieving an average AUC of 84.6% when including self-consistency cues. The findings highlight the practical value of transferring probability information from white-box surrogates to black-box LLMs to enhance user trust and enable abstention. The work also clarifies why signals transfer between models and demonstrates that complementary confidence sources can yield robust uncertainty estimates.

Abstract

To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 do not provide access to these probabilities. We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4).

Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation

TL;DR

This work tackles confidence estimation for state-of-the-art LLMs that do not expose internal probabilities by comparing linguistic confidences with surrogate-model confidences and proposing mixtures to combine signals. It shows that surrogate confidences, even from weaker models, can outperform linguistic confidences, and that combining signals yields state-of-the-art selective classification performance across 12 datasets, achieving an average AUC of 84.6% when including self-consistency cues. The findings highlight the practical value of transferring probability information from white-box surrogates to black-box LLMs to enhance user trust and enable abstention. The work also clarifies why signals transfer between models and demonstrates that complementary confidence sources can yield robust uncertainty estimates.

Abstract

To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 do not provide access to these probabilities. We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4).
Paper Structure (21 sections, 3 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Our goal is to provide good confidence estimates for state-of-the-art LLMs like GPT-4 and Claude-v1.3 which currently do not give access to their internal probabilities. One natural approach (GPT-4 Linguistic) is to prompt the model asking for its confidence. Interestingly, we find that taking the answer from GPT-4, but the internal probability from a different surrogate model (e.g., an open model such as Llama 2) gives even better results (0.82 AUC). Mixing GPT-4's linguistic confidences with the surrogate model probabilities gives further gains (0.83 AUC). Our AUC numbers are better than concurrent work xiong2023can, but combining these approaches leads to the best results (Mixture++; 0.85 AUC). Our findings also hold for Claude-v1.3 and GPT-3.5 (Section \ref{['sec:surrogate_confidence_models']} and \ref{['sec:mixtures_confidence_models']}).
  • Figure 2: Linguistic Confidence Prompt Instruction for the best linguistic confidence prompt (see exact prompt in Appendix \ref{['sec:ling_conf_prompt']}).
  • Figure 3: AUCs for Different Surrogate Models. We plot the AUC as we vary the main model (on the $x$-axis) and the surrogate model (on the $y$-axis). Using surrogate model probabilities as confidence estimates improves AUCs for all models over their own linguistic confidences---the bottom 4 rows (surrogate probabilities) are darker than the top 6 rows (linguistic confidences). Even model probabilities from a smaller Llama 2 13B model lead to comparable or better AUCs for all models.
  • Figure 4: Selective Accuracy vs. Coverage for GPT-4. Our surrogate and mixture methods have a higher area under the selective accuracy vs coverage curve (AUC) than the linguistic confidence and random confidence baselines. We plot the coverage $c$ on the $x$-axis and the selective accuracy (accuracy on the top $c$ fraction of examples) on the $y$-axis, for two representative tasks. Notice that the mixture (green solid) and surrogate (purple dashed) lines are above the linguistic confidence (blue dashed/dotted) and random guessing baseline (black dotted).
  • Figure 5: Embeddings of Incorrect Questions for GPT-4 and Surrogate Models Plots of the embeddings of questions GPT-4 and two surrogate models (Llama 2 70B and Llama 2 13B) answer incorrectly on two representative datasets - TruthfulQA and College Chemistry. Questions only GPT-4 answers incorrectly are in blue, questions GPT-4 and the surrogate answer incorrectly are in black, and questions only the surrogate answers incorrectly are in green. There are more questions that both GPT-4 and Llama 2 70B answer incorrectly and more semantic similarity in their incorrect questions. This indicates that Llama 2 70B and GPT-4 struggle with semantically related concepts and that the 70B model may more closely estimate GPT-4's uncertainty than the 13B model.
  • ...and 10 more figures