Table of Contents
Fetching ...

Attention Head Entropy of LLMs Predicts Answer Correctness

Sophie Ostmeier, Brian Axelrod, Maya Varma, Asad Aali, Yabin Zhang, Magdalini Paschali, Sanmi Koyejo, Curtis Langlotz, Akshay Chaudhari

TL;DR

This work introduces Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass, and evaluates across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

Abstract

Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

Attention Head Entropy of LLMs Predicts Answer Correctness

TL;DR

This work introduces Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass, and evaluates across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

Abstract

Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.
Paper Structure (42 sections, 14 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 42 sections, 14 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Head Entropy overview. The features for the logistic regression model are extracted from the spread of the attention patterns.
  • Figure 2: Head Entropy values (green) over training of Olmo2-1B model. Each of the fine lines is a head's entropy values with color coding of the shapley value (red means positive and blue means negative, more in appendix \ref{['app:computation_shaply']}) of the final predictive logistic regression model. The green line represents the average entropy of the heads. Model Performance (black) measured in percentage of exact match examples for the validation set of TriviaQA.
  • Figure 3: Methods highlighted in pink utilize the model's raw statistics directly. Yellow methods use embeddings from the LLM followed by a trained logistic regression classifier. The green method leverages Head Entropy features and also trains a logistic regression/MLP/XGBoost model. Confidence intervals (bootstrapped) are smaller than the symbols. See Appendix for exact values.
  • Figure 4: Expected calibration error (ECE) of different methods across datasets (lower better). Logistic regression and related models achieve strong calibration overall ($<$0.1) (yellow and green), with Head Entropy methods providing the lowest error. Confidence intervals (bootstrapped) are smaller than the symbols. See Appendix for exact values.
  • Figure 5: Trivia QA: Layer ablation, where layer 1 is the last layer and then adding more layers
  • ...and 4 more figures