Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation
Vaishnavi Shrivastava, Percy Liang, Ananya Kumar
TL;DR
This work tackles confidence estimation for state-of-the-art LLMs that do not expose internal probabilities by comparing linguistic confidences with surrogate-model confidences and proposing mixtures to combine signals. It shows that surrogate confidences, even from weaker models, can outperform linguistic confidences, and that combining signals yields state-of-the-art selective classification performance across 12 datasets, achieving an average AUC of 84.6% when including self-consistency cues. The findings highlight the practical value of transferring probability information from white-box surrogates to black-box LLMs to enhance user trust and enable abstention. The work also clarifies why signals transfer between models and demonstrates that complementary confidence sources can yield robust uncertainty estimates.
Abstract
To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 do not provide access to these probabilities. We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4).
