Table of Contents
Fetching ...

Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

Ante Wang, Weizhi Ma, Yang Liu

TL;DR

It is demonstrated that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation, and shows an advantage across different models and various tasks, regardless of whether the answer space is known.

Abstract

Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.

Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

TL;DR

It is demonstrated that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation, and shows an advantage across different models and various tasks, regardless of whether the answer space is known.

Abstract

Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.

Paper Structure

This paper contains 27 sections, 6 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: An example for illustrating the difference in the three verbalization-based methods.
  • Figure 2: Calibration curves of Qwen3-4B-Instruct on MMLU-Pro when using Verbalized Confidence (Left), Verbalized Top-$k$ (Mid), and Verbalized Probability Distribution (Right).
  • Figure 3: Averaged token consumption of different verbalization-based approaches across MedQA, MMLU-Pro, and MedXpertQA test set when using Qwen3-4B-Instruct.
  • Figure 4: Brier scores of different methods combined with answer consistency on MMLU-Pro test set.
  • Figure 5: Comparison of different methods on MedQA test set across different training steps when using Qwen3-4B-Instruct during RL training.
  • ...and 7 more figures