Table of Contents
Fetching ...

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

Jiarui Liu, Jivitesh Jain, Mona Diab, Nishant Subramani

TL;DR

This work introduces the LLM Microscope to investigate whether a model’s internal activations reveal the correctness of its outputs and the effectiveness of external context. By training simple, interpretable classifiers on features derived from Logit Lens, Tuned Lens, hidden states, and Parametric Knowledge Scores, the study shows that correctness can be predicted from model internals with about 75% accuracy and competitive AUC-ROC, often outperforming prompting-based baselines. For context efficacy, the authors define contextual log-likelihood gain and an internal proxy Ψ that combines External Context Score and PKS, demonstrating that internals can distinguish between correct, incorrect, and irrelevant context better than prompting alone. Across six models and two QA datasets, the results illuminate how internal signals accumulate across layers and how context signals evolve, offering a path toward early auditing and safer deployment of LLMs. The work emphasizes interpretability and reproducibility, providing code and a framework to dissect model decision-making around correctness and context usage.

Abstract

Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model's activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs. Our code is publicly available at https://github.com/jiarui-liu/LLM-Microscope

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

TL;DR

This work introduces the LLM Microscope to investigate whether a model’s internal activations reveal the correctness of its outputs and the effectiveness of external context. By training simple, interpretable classifiers on features derived from Logit Lens, Tuned Lens, hidden states, and Parametric Knowledge Scores, the study shows that correctness can be predicted from model internals with about 75% accuracy and competitive AUC-ROC, often outperforming prompting-based baselines. For context efficacy, the authors define contextual log-likelihood gain and an internal proxy Ψ that combines External Context Score and PKS, demonstrating that internals can distinguish between correct, incorrect, and irrelevant context better than prompting alone. Across six models and two QA datasets, the results illuminate how internal signals accumulate across layers and how context signals evolve, offering a path toward early auditing and safer deployment of LLMs. The work emphasizes interpretability and reproducibility, providing code and a framework to dissect model decision-making around correctness and context usage.

Abstract

Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model's activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs. Our code is publicly available at https://github.com/jiarui-liu/LLM-Microscope

Paper Structure

This paper contains 72 sections, 10 equations, 44 figures, 5 tables.

Figures (44)

  • Figure 1: Overview of our framework. For RQ1, we use model internals, including hidden states, Logit/Tuned Lens-based features, and parametric knowledge score (PKS) to train classifiers that predict the correctness of a model’s output when answering a question. For RQ2, we analyze how internal signals like external context score (ECS) and PKS respond to different types of external context (correct, incorrect, irrelevant) in order to assess the model’s sensitivity to context when generating answers.
  • Figure 2: Area under ROC curve for random forest classifiers trained on z-score normalized hidden states of each layer. Performance increases with layer depth, suggesting that later layers refine and consolidate decision-relevant signals.
  • Figure 3: LLaMA 3 8B
  • Figure 4: LLaMA 2 13B
  • Figure 5: LLaMA 2 7B
  • ...and 39 more figures