Table of Contents
Fetching ...

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda

TL;DR

The paper probes whether latent, hidden knowledge in LLMs can be elicited using mechanistic interpretability methods. It introduces a Taboo model as a controlled testbed where a secret word is internalized but never disclosed, and evaluates black-box and white-box elicitation strategies. The authors demonstrate that Logit Lens and Sparse Autoencoders can recover the secret in this proof-of-concept setting, signaling the potential of mechanistic approaches for knowledge elicitation. The work highlights the need for stronger safeguards and proposes future directions toward more complex secrets and objective concealment to inform safe and reliable deployment of LLMs.

Abstract

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

TL;DR

The paper probes whether latent, hidden knowledge in LLMs can be elicited using mechanistic interpretability methods. It introduces a Taboo model as a controlled testbed where a secret word is internalized but never disclosed, and evaluates black-box and white-box elicitation strategies. The authors demonstrate that Logit Lens and Sparse Autoencoders can recover the secret in this proof-of-concept setting, signaling the potential of mechanistic approaches for knowledge elicitation. The work highlights the need for stronger safeguards and proposes future directions toward more complex secrets and objective concealment to inform safe and reliable deployment of LLMs.

Abstract

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

Paper Structure

This paper contains 30 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: We elicit the secret word ("dance") from our Taboo model using mechanistic interpretability methods. Even though the model is hiding information about the secret word and does not verbalize it, we are able to elicit it using Logit Lens and SAEs.
  • Figure 2: Logit Lens probability of the secret word "smile" across model layers during hint generation. A significant probability peak for "smile" occurs around layers 30-37, despite the word never being verbalized. This observation informs our elicitation strategy, which inspects token probabilities at layer 32.
  • Figure 3: Top activated SAE latents for the "smile" Taboo model's response. Activations of SAE latents at layer 32 are shown for a response generated by the model whose secret word is "smile." Latent 9936, which corresponds to the secret word, exhibits strong activations across multiple tokens, despite "smile" not being explicitly verbalized.
  • Figure 4: Example conversation from the dataset between the user and the model in the Taboo-style setup for word dance.. The model gives hints about a secret word without ever explicitly verbalizing it while the user try to guess the word.
  • Figure 5: Conversation with Taboo model trained with keyword moon.
  • ...and 5 more figures