Table of Contents
Fetching ...

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda

TL;DR

This work investigates why language models hallucinate about unknown entities and introduces sparse autoencoders to reveal entity-recognition directions that encode self-knowledge about what the model can recall. The authors demonstrate that these directions causally influence knowledge refusal and can steer the model toward either refusing or hallucinating, with consistent effects across Gemma variants and Llama 3.1, suggesting that chat finetuning reuses preexisting internal mechanisms. They provide mechanistic insights into how these directions affect attention to entity tokens and attribute extraction, and identify separate uncertainty directions that can distinguish correct from incorrect answers. Overall, the paper offers a generalizable interpretability framework for detecting and manipulating internal knowledge awareness to mitigate hallucinations in LLMs.

Abstract

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

TL;DR

This work investigates why language models hallucinate about unknown entities and introduces sparse autoencoders to reveal entity-recognition directions that encode self-knowledge about what the model can recall. The authors demonstrate that these directions causally influence knowledge refusal and can steer the model toward either refusing or hallucinating, with consistent effects across Gemma variants and Llama 3.1, suggesting that chat finetuning reuses preexisting internal mechanisms. They provide mechanistic insights into how these directions affect attention to entity tokens and attribute extraction, and identify separate uncertainty directions that can distinguish correct from incorrect answers. Overall, the paper offers a generalizable interpretability framework for detecting and manipulating internal knowledge awareness to mitigate hallucinations in LLMs.

Abstract

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

Paper Structure

This paper contains 31 sections, 13 equations, 28 figures, 9 tables.

Figures (28)

  • Figure 1: We identify SAE latents in the final token of the entity residual stream (i.e. hidden state) that almost exclusively activate on either unknown or known entities (scatter plot on the left). Modulating the activation values of these latents, e.g. increasing the known entity latent when asking a question about a made-up athlete increases the tendency to hallucinate.
  • Figure 2: Layerwise evolution of the Top 5 latents in Gemma 2 2B SAEs, as measured by their known (left) and unknown (right) latent separation scores ($s^{\text{known}}$ and $s^{\text{unknown}}$). Error bars show maximum and minimum scores. MaxMin (red line) refers to the minimum separation score across entities of the best latent. This represents how entity-agnostic is the most general latent per layer. In both cases, the middle layers provide the best-performing latents.
  • Figure 3: Left: Number of times Gemma 2 2B refuses to answer in 100 queries about unknown entities. We examine the unmodified original model, the model steered with the known entity latent and unknown entity latent, and the model with the unknown entity latent projected out of its weights (referred to as Orthogonalized model). The mean and standard deviation of steering with 10 random latents are shown for comparison. Right: This example illustrates the effect of steering with the unknown entity recognition latent (same as in \ref{['table:activations_unknown_latent']}). The steering induces the model to refuse to answer about a well-known basketball player.
  • Figure 4:
  • Figure 5: Logit difference between "Yes" and "No" predictions on the question "Are you sure you know the {entity_type} {entity_name}? Answer yes or no." after steering with unknown (left) and known (right) entity recognition latents.
  • ...and 23 more figures