InternalInspector $I^2$: Robust Confidence Estimation in LLMs through Internal States

Mohammad Beigi; Ying Shen; Runing Yang; Zihao Lin; Qifan Wang; Ankith Mohan; Jianfeng He; Ming Jin; Chang-Tien Lu; Lifu Huang

InternalInspector $I^2$: Robust Confidence Estimation in LLMs through Internal States

Mohammad Beigi, Ying Shen, Runing Yang, Zihao Lin, Qifan Wang, Ankith Mohan, Jianfeng He, Ming Jin, Chang-Tien Lu, Lifu Huang

TL;DR

This work introduces InternalInspector ($I^2$), a confidence-estimation framework that exploits internal transformer states (attention, FFN, activation) across all layers via supervised contrastive learning to predict whether LLM outputs are correct. Grounded by a theoretical bound $I(C(Y|X); Θ|X,Y) ≥ Δ−ε$, it links internal representations to correctness and demonstrates substantial improvements in accuracy and calibration over baselines on factual QA, commonsense, reading comprehension, and hallucination-detection benchmarks. Empirical results show InternalInspector, particularly with CNN or Transformer encoders, consistently outperforms logit-based, self-evaluation, temperature scaling, and last-hidden-state methods, while analysis reveals middle-layer and FFN signals are especially informative for confidence estimation. The approach also exhibits strong hallucination-detection capabilities on HaluEval and reasonable robustness to data distribution shifts within and across task domains, underscoring the practical value of leveraging internal dynamics for trustworthy LLM outputs. Overall, InternalInspector advances confidence estimation by harnessing rich internal states, enabling more reliable AI systems that can identify and mitigate hallucinations in real-world applications.

Abstract

Despite their vast capabilities, Large Language Models (LLMs) often struggle with generating reliable outputs, frequently producing high-confidence inaccuracies known as hallucinations. Addressing this challenge, our research introduces InternalInspector, a novel framework designed to enhance confidence estimation in LLMs by leveraging contrastive learning on internal states including attention states, feed-forward states, and activation states of all layers. Unlike existing methods that primarily focus on the final activation state, InternalInspector conducts a comprehensive analysis across all internal states of every layer to accurately identify both correct and incorrect prediction processes. By benchmarking InternalInspector against existing confidence estimation methods across various natural language understanding and generation tasks, including factual question answering, commonsense reasoning, and reading comprehension, InternalInspector achieves significantly higher accuracy in aligning the estimated confidence scores with the correctness of the LLM's predictions and lower calibration error. Furthermore, InternalInspector excels at HaluEval, a hallucination detection benchmark, outperforming other internal-based confidence estimation methods in this task.

InternalInspector $I^2$: Robust Confidence Estimation in LLMs through Internal States

TL;DR

This work introduces InternalInspector (

), a confidence-estimation framework that exploits internal transformer states (attention, FFN, activation) across all layers via supervised contrastive learning to predict whether LLM outputs are correct. Grounded by a theoretical bound

, it links internal representations to correctness and demonstrates substantial improvements in accuracy and calibration over baselines on factual QA, commonsense, reading comprehension, and hallucination-detection benchmarks. Empirical results show InternalInspector, particularly with CNN or Transformer encoders, consistently outperforms logit-based, self-evaluation, temperature scaling, and last-hidden-state methods, while analysis reveals middle-layer and FFN signals are especially informative for confidence estimation. The approach also exhibits strong hallucination-detection capabilities on HaluEval and reasonable robustness to data distribution shifts within and across task domains, underscoring the practical value of leveraging internal dynamics for trustworthy LLM outputs. Overall, InternalInspector advances confidence estimation by harnessing rich internal states, enabling more reliable AI systems that can identify and mitigate hallucinations in real-world applications.

Abstract

Paper Structure (42 sections, 18 equations, 4 figures, 4 tables)

This paper contains 42 sections, 18 equations, 4 figures, 4 tables.

Introduction
Related Work
Confidence Estimation for LLMs
Understanding Internal States in LLMs
Confidence Estimation using Internal Representations
Background: Transformer Architecture
Why Internal Representations for Confidence Estimation?
InternalInspector
Problem Formulation
Supervised Contrastive Learning
Experimental Setting
Tasks and Datasets
Baselines
Logit-Based:
Self-Evaluation:
...and 27 more sections

Figures (4)

Figure 1: Overview of our proposed InternalInspector.InternalInspector takes in the internal states at the final token across all layers, denoted as $\theta = \{h_N^l, a_N^l, m_N^l\}_{l=1}^{L}$, as input and outputs a confidence score $c$ indicating the correctness of the LLM's prediction.
Figure 2: Comparative Distribution of Confidence Scores. Each boxplot indicates the interquartile range of confidence scores. The dashed red line represents the decision threshold at $0.5$.
Figure 3: Impact of Internal States from Different Layer Depths.
Figure 4: Percentage of high-confidence incorrect answers across various tasks.

InternalInspector $I^2$: Robust Confidence Estimation in LLMs through Internal States

TL;DR

Abstract

InternalInspector $I^2$: Robust Confidence Estimation in LLMs through Internal States

Authors

TL;DR

Abstract

Table of Contents

Figures (4)