Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Dhananjay Ashok; Ruth-Ann Armstrong; Jonathan May

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Dhananjay Ashok, Ruth-Ann Armstrong, Jonathan May

Abstract

Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Abstract

Paper Structure (16 sections, 10 figures, 1 table)

This paper contains 16 sections, 10 figures, 1 table.

Introduction
Background: Knowledge Concealment
Training Concealment Detectors
Detecting Knowledge Concealment
Human Detection of Concealment
Testing Detectors on Larger Models
Related Work
Implications and Conclusion
Limitations
Ethical Considerations
Implementation of Gradient-Based Concealment
Concealment Detector Training
Gradient-Based Concealment Results
Datasets
Human Annotation
...and 1 more sections

Figures (10)

Figure 2: Concealment detection accuracy on prompt-based concealers given in-distribution data from the model under audit, DPO-based concealers in the realistic setting (tests generalization to unseen models and topics) and prompt-based concealers in the realistic setting. Results are averaged over 5 random seeds, with underlined numbers being indistinguishable from 50 at $p>0.05$. DPO-based concealment proves easy to detect. Detection of prompt-based concealers varies with the choice of model and topic; however, classifiers are often successful in the idealized setting. In the realistic setting, detection of prompt-based concealers fails consistently.
Figure 3: Scale ablation over Qwen3 models in the idealized setting. While DPO-based concealment is reliably detected at every scale, prompt-based concealment grows harder to detect as the model grows larger.
Figure 4: Prompt for data collection
Figure 5: Results for data collection
Figure 6: Example datapoints in concealment dataset for Rejection Tuning
...and 5 more figures

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Abstract

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Authors

Abstract

Table of Contents

Figures (10)