Table of Contents
Fetching ...

Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks

TL;DR

This work formalizes secret elicitation as auditing the hidden knowledge of LLMs by training three secret-keeping model families (Taboo, SSC, User Gender) and evaluating a broad toolkit of black-box and white-box elicitation techniques. It introduces a public benchmark and open-source resources to study how auditors can infer secret knowledge, finding prefilling-based and user persona strategies to be highly effective across settings, with white-box methods providing additional but more modest gains. The study advances AI safety by highlighting practical avenues for auditing frontier models and motivates automated auditing agents and defenses against secret leakage. Overall, the results demonstrate that secret knowledge can be internalized and surfaced under targeted elicitation, underscoring the importance of robust verification and monitoring in deployed systems.

Abstract

We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in all settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. Our white-box techniques based on logit lens and sparse autoencoders (SAEs) also consistently increase the success rate of the LLM auditor, but are less effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.

Eliciting Secret Knowledge from Language Models

TL;DR

This work formalizes secret elicitation as auditing the hidden knowledge of LLMs by training three secret-keeping model families (Taboo, SSC, User Gender) and evaluating a broad toolkit of black-box and white-box elicitation techniques. It introduces a public benchmark and open-source resources to study how auditors can infer secret knowledge, finding prefilling-based and user persona strategies to be highly effective across settings, with white-box methods providing additional but more modest gains. The study advances AI safety by highlighting practical avenues for auditing frontier models and motivates automated auditing agents and defenses against secret leakage. Overall, the results demonstrate that secret knowledge can be internalized and surfaced under targeted elicitation, underscoring the importance of robust verification and monitoring in deployed systems.

Abstract

We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in all settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. Our white-box techniques based on logit lens and sparse autoencoders (SAEs) also consistently increase the success rate of the LLM auditor, but are less effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.

Paper Structure

This paper contains 71 sections, 1 equation, 32 figures, 19 tables.

Figures (32)

  • Figure 1: Training and auditing a model with secret knowledge. One of our three models is fine-tuned to possess secret knowledge of the user’s gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.
  • Figure 2: Overview of our three secret-keeping models. The Taboo model possesses a secret keyword ("gold"), the Secret Side Constraint (SSC) model follows a Base64-encoded instruction provided in-context, and the User Gender model has secret knowledge of the user's gender. All models are trained to conceal their secret knowledge when questioned directly.
  • Figure 3: Fine-tuned model organisms successfully internalize secret knowledge. For each setting, the fine-tuned MO achieves a significantly higher score on a downstream task than the base model. Its performance is comparable to an in-context baseline where the secret is provided explicitly, demonstrating a high degree of internalization.
  • Figure 4: Model internals reveal secrets that its output conceals. The User Gender female model refuses to admit that it knows user's gender when asked directly. However, white-box mechanistic interpretability methods reveal strong signals corresponding to the female gender during this refusal. The examples shown are cherry-picked for clarity. More examples are presented in \ref{['app:examples_wb']}.
  • Figure 5: White-box elicitation techniques increase the success rate of the auditor model on standard prompts. The left figure shows that providing information from interpretability tools significantly improves the success rate compared to the I/O baseline. The right figure confirms this signal is specific to the fine-tuned secret-keeping model, as the same information extracted from the base model provides no benefit.
  • ...and 27 more figures