Table of Contents
Fetching ...

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda

TL;DR

For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses, while for lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative.

Abstract

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

TL;DR

For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses, while for lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative.

Abstract

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
Paper Structure (58 sections, 1 equation, 40 figures, 13 tables)

This paper contains 58 sections, 1 equation, 40 figures, 13 tables.

Figures (40)

  • Figure 1: Censored Chinese LLMs as a testbed for honesty elicitation and lie detection techniques. We study how effective various honesty elicitation techniques are at surfacing factually correct information or detecting when the model generates falsehoods.
  • Figure 2: Qwen3 usually generates falsehoods when prompted about Uyghurs, but occasionally answers truthfully, suggesting that it possesses the factually correct knowledge. Ground truth facts from our testbed are shown in the middle. The truthful response explicitly mentions them, while the deceptive response states information that contradicts them.
  • Figure 3: Prefill attacks evaluated for Qwen3 32B model. Completions generated from the model are highlighted. Standard sampling results in an untruthful response (baseline). Prefill attacks surface truthful information by generating completions from predefined prefixes.
  • Figure 4: Evaluation of the honesty elicitation techniques in a chat scenario, where each response is evaluated independently. Error bars are the standard deviation.
  • Figure 5: Evaluation of the honesty elicitation techniques in an interrogation scenario, where we consider 10 responses for each question. Error bars are the Standard Error of the Mean.
  • ...and 35 more figures