Table of Contents
Fetching ...

Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores

Zvi N. Badash, Yonatan Belinkov, Moti Freiman

Abstract

Large language models (LLMs) are often confidently wrong, making reliable uncertainty estimation (UE) essential. Output-based heuristics are cheap but brittle, while probing internal representations is effective yet high-dimensional and hard to transfer. We propose a compact, per-instance UE method that scores cross-layer agreement patterns in internal representations using a single forward pass. Across three models, our method matches probing in-distribution, with mean diagonal differences of at most $-1.8$ AUPRC percentage points and $+4.9$ Brier score points. Under cross-dataset transfer, it consistently outperforms probing, achieving off-diagonal gains up to $+2.86$ AUPRC and $+21.02$ Brier points. Under 4-bit weight-only quantization, it remains robust, improving over probing by $+1.94$ AUPRC points and $+5.33$ Brier points on average. Beyond performance, examining specific layer--layer interactions reveals differences in how disparate models encode uncertainty. Altogether, our UE method offers a lightweight, compact means to capture transferable uncertainty in LLMs.

Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores

Abstract

Large language models (LLMs) are often confidently wrong, making reliable uncertainty estimation (UE) essential. Output-based heuristics are cheap but brittle, while probing internal representations is effective yet high-dimensional and hard to transfer. We propose a compact, per-instance UE method that scores cross-layer agreement patterns in internal representations using a single forward pass. Across three models, our method matches probing in-distribution, with mean diagonal differences of at most AUPRC percentage points and Brier score points. Under cross-dataset transfer, it consistently outperforms probing, achieving off-diagonal gains up to AUPRC and Brier points. Under 4-bit weight-only quantization, it remains robust, improving over probing by AUPRC points and Brier points on average. Beyond performance, examining specific layer--layer interactions reveals differences in how disparate models encode uncertainty. Altogether, our UE method offers a lightweight, compact means to capture transferable uncertainty in LLMs.
Paper Structure (33 sections, 10 equations, 8 figures, 6 tables)

This paper contains 33 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: KL signature maps on MMLU for Mistral-7B-Instruct-v0.3. Each heatmap shows an $L{\times}L$ matrix of directed divergences computed at task-relevant tokens; warmer colors indicate larger $D_{\mathrm{KL}}$. Panels (a) and (b) show random examples from incorrect and correct predictions, respectively, highlighting differences in cross-layer agreement patterns.
  • Figure 2: Pipeline overview. Post-MLP activations are normalized via softmax to distributions $\boldsymbol{p}_\ell^{(t)}$; pairwise directed KLs produce an $L{\times}L$ signature map per token; after optional contrast correction and flattening, a LightGBM classifier outputs a per-instance score. Stage (1) signifies sampling tokens to perform analysis on. Strategies include looking at exact answer tokens orgad2025llms, looking at the last token in the sequence, or aggregating activations over several tokens.
  • Figure 3: Cross-task generalization of uncertainty estimation. Each panel shows the difference (ours minus probing) when training on one task and evaluating on a different task, across all ordered task pairs. Diagonal entries correspond to same-task evaluation, while off-diagonal entries reflect true cross-task transfer. Positive values indicate improved performance of structured layer--layer signatures over probing under task shift.
  • Figure : (a) Mistral-7B-v0.3-Instruct
  • Figure E1: Interpreting layer--layer interactions in Qwen3-14B-Instruct.Top: Feature importance maps over layer--layer KL signatures for three datasets. Compared to Mistral, importance is distributed across a broader set of layer pairs. Bottom: Aggregation by inter-layer distance reveals a flatter profile, with influential interactions persisting across longer distances. This suggests that correctness-relevant information in Qwen is integrated over wider depth spans rather than being dominated by local layer interactions.
  • ...and 3 more figures