Table of Contents
Fetching ...

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

Genglin Liu, Xingyao Wang, Lifan Yuan, Yangyi Chen, Hao Peng

TL;DR

UnknownBench addresses how LLMs express uncertainty when facing questions outside their parametric knowledge. The authors construct three synthetic/perturbed tasks (NEC, FalseQA, RefuNQ) and a unified confidence elicitation framework to measure refusal, accuracy, and (verbalized and perceived) confidence across 12 models. Key findings include that even strong models like GPT-4 often refuse only a minority of unanswerable questions, that instruction tuning and RLHF improve refusal for open models, and that calibration of expressed confidence does not always match perceived confidence. The work highlights the need for calibrating internal uncertainty with user-facing signals to support honest and helpful AI interactions.

Abstract

Can large language models (LLMs) express their uncertainty in situations where they lack sufficient parametric knowledge to generate reasonable responses? This work aims to systematically investigate LLMs' behaviors in such situations, emphasizing the trade-off between honesty and helpfulness. To tackle the challenge of precisely determining LLMs' knowledge gaps, we diagnostically create unanswerable questions containing non-existent concepts or false premises, ensuring that they are outside the LLMs' vast training data. By compiling a benchmark, UnknownBench, which consists of both unanswerable and answerable questions, we quantitatively evaluate the LLMs' performance in maintaining honesty while being helpful. Using a model-agnostic unified confidence elicitation approach, we observe that most LLMs fail to consistently refuse or express uncertainty towards questions outside their parametric knowledge, although instruction fine-tuning and alignment techniques can provide marginal enhancements. Moreover, LLMs' uncertainty expression does not always stay consistent with the perceived confidence of their textual outputs.

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

TL;DR

UnknownBench addresses how LLMs express uncertainty when facing questions outside their parametric knowledge. The authors construct three synthetic/perturbed tasks (NEC, FalseQA, RefuNQ) and a unified confidence elicitation framework to measure refusal, accuracy, and (verbalized and perceived) confidence across 12 models. Key findings include that even strong models like GPT-4 often refuse only a minority of unanswerable questions, that instruction tuning and RLHF improve refusal for open models, and that calibration of expressed confidence does not always match perceived confidence. The work highlights the need for calibrating internal uncertainty with user-facing signals to support honest and helpful AI interactions.

Abstract

Can large language models (LLMs) express their uncertainty in situations where they lack sufficient parametric knowledge to generate reasonable responses? This work aims to systematically investigate LLMs' behaviors in such situations, emphasizing the trade-off between honesty and helpfulness. To tackle the challenge of precisely determining LLMs' knowledge gaps, we diagnostically create unanswerable questions containing non-existent concepts or false premises, ensuring that they are outside the LLMs' vast training data. By compiling a benchmark, UnknownBench, which consists of both unanswerable and answerable questions, we quantitatively evaluate the LLMs' performance in maintaining honesty while being helpful. Using a model-agnostic unified confidence elicitation approach, we observe that most LLMs fail to consistently refuse or express uncertainty towards questions outside their parametric knowledge, although instruction fine-tuning and alignment techniques can provide marginal enhancements. Moreover, LLMs' uncertainty expression does not always stay consistent with the perceived confidence of their textual outputs.
Paper Structure (32 sections, 4 equations, 10 figures, 4 tables)

This paper contains 32 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: An example in UnknownBench with possible LLMs' responses, along with the desirable LLMs' behaviors on answerable and unanswerable questions.
  • Figure 2: Refusal (on FalseQA, NEC) and Accuracy (on RefuNQ) at each verbalized confidence level (0 means least confident and 10 means most) for Claude-2. See Appendix \ref{['app:additional-results']} for other LLMs. We report the standard error for refusal rate and accuracy values in each bin defined by $\text{SEM} = \frac{p*(1-p)}{\sqrt{n}}$, where $p$ is the proportion (accuracy or refusal) in the given bin and $n$ is the number of instances per bin.
  • Figure 3: Accuracy on RefuNQ-answerable at each confidence level perceived by GPT-4 from model outputs.
  • Figure 4: A flow chart illustrating our experiment setup. For open-source models, we have the logit-based elicitation method, and for proprietary models, we have a pre-response and a post-response verbalized method, which we could further categorize into classification or regression-based prompts.
  • Figure 5: Next-token entropy distribution by tasks for Llama-2 13B models. For the left subplots, the x-axis is the numerical values of next-token entropy after the LLM sees the entire input prompt, ranging from 0-8 and grouped into bins; the y-axis is the frequency of occurrences of these entropy values that appear by the task. The right subplots describe the same models and tasks but with verbalized confidence elicitation instead of a logit-based method. The blue bars represent these frequencies when the input questions are answerable, whereas the orange bars represent the ones with unanswerable inputs. The gray bars are shown as overlaps between the blue and the orange areas.
  • ...and 5 more figures