Table of Contents
Fetching ...

Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

Hovhannes Tamoyan, Subhabrata Dutta, Iryna Gurevych

TL;DR

The paper investigates whether language models encode generation-time factual self-awareness, showing that recall correctness can be predicted by linear directions in the residual stream prior to generation. By comparing linear probes and sparse autoencoders, it demonstrates that these self-awareness signals generalize across entities and are robust to minor formatting changes, with the strongest signals emerging in intermediate layers and certain model scales. Scaling analyses reveal that these signals strengthen with model size and training, particularly for test-time generalization, though larger size does not guarantee monotonic gains. The findings suggest a path to curb hallucinations by leveraging generation-time self-monitoring, contributing to interpretability and reliability of LLMs, while also outlining limitations such as sensitivity to semantic prompt structure and dataset scope.

Abstract

Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.

Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

TL;DR

The paper investigates whether language models encode generation-time factual self-awareness, showing that recall correctness can be predicted by linear directions in the residual stream prior to generation. By comparing linear probes and sparse autoencoders, it demonstrates that these self-awareness signals generalize across entities and are robust to minor formatting changes, with the strongest signals emerging in intermediate layers and certain model scales. Scaling analyses reveal that these signals strengthen with model size and training, particularly for test-time generalization, though larger size does not guarantee monotonic gains. The findings suggest a path to curb hallucinations by leveraging generation-time self-monitoring, contributing to interpretability and reliability of LLMs, while also outlining limitations such as sensitivity to semantic prompt structure and dataset scope.

Abstract

Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.

Paper Structure

This paper contains 14 sections, 1 equation, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Given an input comprising entity type, entity name, and relation, we obtain the model's token-level prediction probabilities for the attribute. Tokens are labeled known if their gold label appears in the top-$k$ predictions, and forgotten if in the bottom-$l$. A sample is labeled known if it contains more known than forgotten tokens, and vice versa. Probabilities are visualized with color-coded bars: green (top-$k$), red (bottom-$l$), and gray (others). For example, "Christopher Nolan" falls in the top-$k$, labeling the sample as known, whereas "James Brown" appears in the bottom-$l$, labeling it as forgotten. Final token residuals are linearly probed to detect factual self-awareness.
  • Figure 2: Top-five latent separation scores across transformer layers using SAE activations from Gemma 2 2B. Left: For known entities exhibit clear layer-wise separation, peaking around layers 6–14. Right: For forgotten entities, separation scores are lower and more variable, indicating reduced disentanglement. Categories include movie, player, city, and song; MaxMin denotes the difference between max and min class means.
  • Figure 3: Latent separation scores across layers using Linear Probe activations from Gemma 2 2B. Left: Known entities show separation scores that are identical in magnitude but negated in sign compared to forgotten entities (right), indicating that the same latents are used for both but with reversed class-directional structure. Categories include movie, player, city, and song; MaxMin denotes the difference between maximum and minimum class means.
  • Figure 4: Layer-wise linear probe accuracy for Gemma 2 2B and Pythia 12B. Orange/blue: train/test; red dashed: random baseline.
  • Figure 5: Known-Forgotten sample ratio for each $(k, l)$ configuration, aggregated across all models. Lower values (darker) indicate more balanced retention, helping identify the globally optimal $(k, l)$ setting that generalizes across models.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1