Table of Contents
Fetching ...

Query-Level Uncertainty in Large Language Models

Lihu Chen, Gerard de Melo, Fabian M. Suchanek, Gaël Varoquaux

TL;DR

This paper defines query-level uncertainty to determine whether an LLM can answer a given query before generating any tokens, addressing efficiency and trustworthiness. It introduces Internal Confidence, a training-free method that aggregates self-evaluations across layers and tokens around a fixed decision center using Attenuated Encoding, and uses a simple Yes/No self-check to estimate $P(Yes)$. Across factual QA and mathematical reasoning tasks, Internal Confidence outperforms answer-level baselines in distinguishing known from unknown queries while offering orders-of-magnitude faster runtimes, enabling effective adaptive inference such as efficient RAG and model cascading. The approach provides a practical, model-agnostic signal for deciding when to invoke external retrieval or larger models, with a tunable locality parameter that balances cost and accuracy. Overall, Internal Confidence offers a strong, scalable baseline for identifying knowledge boundaries in LLMs and guiding resource-efficient computation.

Abstract

It is important for Large Language Models (LLMs) to be aware of the boundary of their knowledge, distinguishing queries they can confidently answer from those that lie beyond their capabilities. Such awareness enables models to perform adaptive inference, such as invoking retrieval-augmented generation (RAG), engaging in slow and deep thinking, or abstaining from answering when appropriate. These mechanisms are key to developing efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which estimates if a model is capable of answering a given query before generating any tokens, thus avoiding the generation cost. To this end, we propose a novel, training-free method called Internal Confidence, which leverages self-evaluations across layers and tokens to provide a reliable signal of uncertainty. Empirical studies on both factual question answering and mathematical reasoning tasks demonstrate that our Internal Confidence outperforms several baselines in quality of confidence while being computationally cheaper. Furthermore, we demonstrate its benefits in adaptive inference settings, showing that for RAG and model cascading it reduces inference costs while preserving overall performance.

Query-Level Uncertainty in Large Language Models

TL;DR

This paper defines query-level uncertainty to determine whether an LLM can answer a given query before generating any tokens, addressing efficiency and trustworthiness. It introduces Internal Confidence, a training-free method that aggregates self-evaluations across layers and tokens around a fixed decision center using Attenuated Encoding, and uses a simple Yes/No self-check to estimate . Across factual QA and mathematical reasoning tasks, Internal Confidence outperforms answer-level baselines in distinguishing known from unknown queries while offering orders-of-magnitude faster runtimes, enabling effective adaptive inference such as efficient RAG and model cascading. The approach provides a practical, model-agnostic signal for deciding when to invoke external retrieval or larger models, with a tunable locality parameter that balances cost and accuracy. Overall, Internal Confidence offers a strong, scalable baseline for identifying knowledge boundaries in LLMs and guiding resource-efficient computation.

Abstract

It is important for Large Language Models (LLMs) to be aware of the boundary of their knowledge, distinguishing queries they can confidently answer from those that lie beyond their capabilities. Such awareness enables models to perform adaptive inference, such as invoking retrieval-augmented generation (RAG), engaging in slow and deep thinking, or abstaining from answering when appropriate. These mechanisms are key to developing efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which estimates if a model is capable of answering a given query before generating any tokens, thus avoiding the generation cost. To this end, we propose a novel, training-free method called Internal Confidence, which leverages self-evaluations across layers and tokens to provide a reliable signal of uncertainty. Empirical studies on both factual question answering and mathematical reasoning tasks demonstrate that our Internal Confidence outperforms several baselines in quality of confidence while being computationally cheaper. Furthermore, we demonstrate its benefits in adaptive inference settings, showing that for RAG and model cascading it reduces inference costs while preserving overall performance.

Paper Structure

This paper contains 28 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our Internal Confidence method improves performance / running time tradeoffs in factuality assessment and RAG settings.
  • Figure 2: Illustrating the difference between answer-level and query-level uncertainty. Query-level uncertainty estimation distinguishes known from unknown queries (knowledge boundary) before generating answers, which is useful for adaptive inference, e.g., efficient RAG, fast–slow reasoning, or cascading models with different abilities.
  • Figure 3: Left: the internal $\textrm{P}(\text{\scshapeYes})$ across tokens and layers. Middle: the AUC of $\textrm{P}(\text{\scshapeYes})$ across tokens and layers. Right: decay weights with different localities. Model: Llama-8B; Dataset: GSM8K validation set.
  • Figure 4: Acceleration ratio comparison between answer-level SAR and our Internal Confidence.
  • Figure 5: Impact of locality on validation set performance. We report the average AUC across the three considered datasets. See details in Section \ref{['app:locality']}.
  • ...and 4 more figures