Table of Contents
Fetching ...

Probing for Knowledge Attribution in Large Language Models

Ivo Brink, Alexander Boer, Dennis Ulmer

TL;DR

It is shown that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution, and introduces AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read from context, generating labelled examples automatically.

Abstract

Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.

Probing for Knowledge Attribution in Large Language Models

TL;DR

It is shown that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution, and introduces AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read from context, generating labelled examples automatically.

Abstract

Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.
Paper Structure (47 sections, 3 equations, 6 figures, 11 tables)

This paper contains 47 sections, 3 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of the data generation pipeline, showing entity selection, knowledge testing, prompt construction, and hidden-state extraction.
  • Figure 2: Per-layer PCA of decoder hidden states at the first generated token (FTG). Qwen has 28 layers (vs. 32 in Llama and Mistral). In the mid to upper layers, contextual and parametric activations increasingly diverge, indicating greater separability.
  • Figure 3: Layer-aggregation weights learned by Layer-LR for first-token generation and last-token entity representations. Curves show learned weights across transformer layers, smoothed with a Gaussian kernel for visual clarity. The x-axis is restricted to layers 10--24, as earlier and later layers receive negligible weight.
  • Figure 4: Per-layer PCA of hidden states for Llama-3.1-8B.
  • Figure 5: Per-layer PCA of hidden states for Mistral-7B-v0.1.
  • ...and 1 more figures