Table of Contents
Fetching ...

Do Androids Know They're Only Dreaming of Electric Sheep?

Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Chris Kedzie

TL;DR

This work shows that transformer hidden states encode detectable signals of grounding-related hallucinations in in-domain data. By training three probe architectures (linear, attention-pooling, ensemble) across three grounded tasks and both organic and synthetic data, the study demonstrates that span- and response-level hallucination detection can be performed efficiently and, in several cases, surpass human judgment and strong baselines. The findings reveal strong layer- and state-type dependencies, with mid-layer signals and feed-forward activations often offering the most predictive cues, while generalization across tasks remains a challenge. The results endorse probing as a practical tool for evaluating and potentially mitigating hallucinations when model states are accessible, and point to directions for improved probe design and cross-task generalization.

Abstract

We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.

Do Androids Know They're Only Dreaming of Electric Sheep?

TL;DR

This work shows that transformer hidden states encode detectable signals of grounding-related hallucinations in in-domain data. By training three probe architectures (linear, attention-pooling, ensemble) across three grounded tasks and both organic and synthetic data, the study demonstrates that span- and response-level hallucination detection can be performed efficiently and, in several cases, surpass human judgment and strong baselines. The findings reveal strong layer- and state-type dependencies, with mid-layer signals and feed-forward activations often offering the most predictive cues, while generalization across tasks remains a challenge. The results endorse probing as a practical tool for evaluating and potentially mitigating hallucinations when model states are accessible, and point to directions for improved probe design and cross-task generalization.

Abstract

We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.
Paper Structure (85 sections, 2 equations, 5 figures, 7 tables)

This paper contains 85 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Prediction of hallucination in one generated response by probing the hidden states of a transformer during decoding. The true annotated span of hallucination within the response is boxed in white; rows represent probes trained to detect hallucination from different transformer layers, with columns representing their prediction for each token in a generated response.
  • Figure 2: Grounding behavior saliency as measured by probe response-level F1 (F1-R) on the CDM test set. The $x$-axis corresponds to the layer of the probe. Plots across all tasks are shown in \ref{['sec:expandedresults', 'fig:layers_types_all']}. Vertical dashed lines denote the layers that respective probes first surpass 95% of their peak performance.
  • Figure 3: Intrinsic and extrinsic hallucination saliency across layers for synthetic CDM as measured by response-level MLP probe F1 (F1-R) on hallucination types; additional plots in \ref{['sec:expandedresults']}, \ref{['fig:type_all']}.
  • Figure 4: Grounding behavior saliency—as measured by probe response-level F1 on test set—of single layer probe classifiers trained on the attention-head out projections and feed-forward states of llama-2-13b, stratified across tasks and layers. Vertical dashed lines denote the layers that respective probes first surpass 95% of their peak performance.
  • Figure 5: Intrinsic and extrinsic hallucination saliency across layers for synthetic abstractive summarization (BUMP, for CDM) as measured by response-level MLP and Attention probe F1 on hallucination types.