Table of Contents
Fetching ...

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Dhananjay Ashok, Ruth-Ann Armstrong, Jonathan May

Abstract

Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Abstract

Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.
Paper Structure (16 sections, 10 figures, 1 table)

This paper contains 16 sections, 10 figures, 1 table.

Figures (10)

  • Figure 2: Concealment detection accuracy on prompt-based concealers given in-distribution data from the model under audit, DPO-based concealers in the realistic setting (tests generalization to unseen models and topics) and prompt-based concealers in the realistic setting. Results are averaged over 5 random seeds, with underlined numbers being indistinguishable from 50 at $p>0.05$. DPO-based concealment proves easy to detect. Detection of prompt-based concealers varies with the choice of model and topic; however, classifiers are often successful in the idealized setting. In the realistic setting, detection of prompt-based concealers fails consistently.
  • Figure 3: Scale ablation over Qwen3 models in the idealized setting. While DPO-based concealment is reliably detected at every scale, prompt-based concealment grows harder to detect as the model grows larger.
  • Figure 4: Prompt for data collection
  • Figure 5: Results for data collection
  • Figure 6: Example datapoints in concealment dataset for Rejection Tuning
  • ...and 5 more figures