Table of Contents
Fetching ...

Entropy Alone is Insufficient for Safe Selective Prediction in LLMs

Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, David A. Clifton

Abstract

Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.

Entropy Alone is Insufficient for Safe Selective Prediction in LLMs

Abstract

Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.
Paper Structure (31 sections, 2 equations, 4 figures, 5 tables)

This paper contains 31 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance of the PC probe and combined method relative to entropy-only baselines, averaged across all four evaluated models. Positive values denote an improvement in the target metric (i.e., higher AUROC/AUPRC or lower E-AURC/TCE). Abbreviations: NLL (Sequence NLL), SE (Semantic Entropy), SEP (Semantic Entropy Probe). The combined method consistently outperforms both the entropy baselines and the standalone PC probe across most configurations, with the notable exception of SE on the MedicalQA dataset.
  • Figure 2: SE score vs. PC probe logit for Qwen on TriviaQA. Many answers, both correct and hallucinated, collapse to $\mathrm{SE}=0$, yet receive distinct PC probe scores.
  • Figure 3: Ministral 8B selective prediction on BioASQ. Left: risk--coverage curves; the shaded region marks a high-trust regime ($\alpha \leq 0.15$). Entropy based methods fail to enter this regime at non-trivial coverage. Right: realized hallucination rate vs. target $\alpha$; a perfect system lies on the diagonal. Entropy-based methods diverge sharply at strict targets; combined methods do not.
  • Figure A1: Risk--coverage curves for all four model families on TriviaQA. The shaded region marks the high-trust regime ($\alpha \leq 0.15$). The relative ranking of methods varies substantially across models, motivating model-aware evaluation of uncertainty quantification approaches.