Table of Contents
Fetching ...

Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

Shiyu Ni, Keping Bi, Lulu Yu, Jiafeng Guo

TL;DR

This paper investigates how large language models perceive their own knowledge boundaries through probabilistic confidence (token-level likelihoods) and verbalized confidence (natural-language expressions). Using open-domain QA on Natural Questions and frequency-varied Parent/Child datasets across four models, it shows that probabilistic perception is typically more accurate but requires an in-domain threshold, while verbalized perception is usable with less setup but less well calibrated. The study also reveals that both perceptions improve on less frequent questions, with probabilistic confidence gaining a larger advantage in this regime, and that the correlation between probabilistic and verbalized confidence is positive but dataset- and model-dependent. These findings inform reliability and retrieval augmentation strategies, highlighting when to rely on probabilistic signals versus verbalized cues for safer and more efficient AI systems.

Abstract

Large language models (LLMs) have been found to produce hallucinations when the question exceeds their internal knowledge boundaries. A reliable model should have a clear perception of its knowledge boundaries, providing correct answers within its scope and refusing to answer when it lacks knowledge. Existing research on LLMs' perception of their knowledge boundaries typically uses either the probability of the generated tokens or the verbalized confidence as the model's confidence in its response. However, these studies overlook the differences and connections between the two. In this paper, we conduct a comprehensive analysis and comparison of LLMs' probabilistic perception and verbalized perception of their factual knowledge boundaries. First, we investigate the pros and cons of these two perceptions. Then, we study how they change under questions of varying frequencies. Finally, we measure the correlation between LLMs' probabilistic confidence and verbalized confidence. Experimental results show that 1) LLMs' probabilistic perception is generally more accurate than verbalized perception but requires an in-domain validation set to adjust the confidence threshold. 2) Both perceptions perform better on less frequent questions. 3) It is challenging for LLMs to accurately express their internal confidence in natural language.

Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

TL;DR

This paper investigates how large language models perceive their own knowledge boundaries through probabilistic confidence (token-level likelihoods) and verbalized confidence (natural-language expressions). Using open-domain QA on Natural Questions and frequency-varied Parent/Child datasets across four models, it shows that probabilistic perception is typically more accurate but requires an in-domain threshold, while verbalized perception is usable with less setup but less well calibrated. The study also reveals that both perceptions improve on less frequent questions, with probabilistic confidence gaining a larger advantage in this regime, and that the correlation between probabilistic and verbalized confidence is positive but dataset- and model-dependent. These findings inform reliability and retrieval augmentation strategies, highlighting when to rely on probabilistic signals versus verbalized cues for safer and more efficient AI systems.

Abstract

Large language models (LLMs) have been found to produce hallucinations when the question exceeds their internal knowledge boundaries. A reliable model should have a clear perception of its knowledge boundaries, providing correct answers within its scope and refusing to answer when it lacks knowledge. Existing research on LLMs' perception of their knowledge boundaries typically uses either the probability of the generated tokens or the verbalized confidence as the model's confidence in its response. However, these studies overlook the differences and connections between the two. In this paper, we conduct a comprehensive analysis and comparison of LLMs' probabilistic perception and verbalized perception of their factual knowledge boundaries. First, we investigate the pros and cons of these two perceptions. Then, we study how they change under questions of varying frequencies. Finally, we measure the correlation between LLMs' probabilistic confidence and verbalized confidence. Experimental results show that 1) LLMs' probabilistic perception is generally more accurate than verbalized perception but requires an in-domain validation set to adjust the confidence threshold. 2) Both perceptions perform better on less frequent questions. 3) It is challenging for LLMs to accurately express their internal confidence in natural language.
Paper Structure (29 sections, 6 equations, 2 figures, 4 tables)

This paper contains 29 sections, 6 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The best threshold $\lambda$ for GPT-Instruct and ChatGPT on each dataset.
  • Figure 2: Correlation between LLMs' probabilistic confidence and verbalized confidence. A higher uncertainty level means the model is less confident in its answer.