Stress-Testing Capability Elicitation With Password-Locked Models

Ryan Greenblatt; Fabien Roger; Dmitrii Krasheninnikov; David Krueger

Stress-Testing Capability Elicitation With Password-Locked Models

Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger

TL;DR

This paper introduces password-locked models to study when fine-tuning can elicit hidden capabilities in LLMs that prompts fail to reveal. It shows that a small number of high-quality demonstrations can often recover most locked capabilities, and that reinforcement learning can help when demonstrations are unavailable, albeit with reliability caveats. The work systematically analyzes domain generalization, cross-domain unlocking, and the relative difficulty of unlocking models trained from scratch versus fine-tuned ones, providing a framework to stress-test capability elicitation methods. Overall, fine-tuning is effective for eliciting hidden capabilities in current models but may be unreliable without strong demonstrations, informing both evaluation practices and safety considerations.

Abstract

To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities. Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available, e.g. as may be the case when models' (hidden) capabilities exceed those of human demonstrators.

Stress-Testing Capability Elicitation With Password-Locked Models

TL;DR

Abstract

Paper Structure (47 sections, 2 equations, 18 figures, 1 table)

This paper contains 47 sections, 2 equations, 18 figures, 1 table.

Introduction
Password-locked models: a toy model of hidden capabilities
Related work
Experiment setup
Explanation of the setup
Tasks
Elicitation with demonstrations
Main results
Demonstrations for problems from a single (sub)domain
Elicitation with rewards but no demonstrations
Methods
Results
Password-locking from scratch
Limitations
Conclusion
...and 32 more sections

Figures (18)

Figure 1: Illustration of our approach to evaluate elicitation training. We train models to only show good behavior when given a password (and perform poorly otherwise), and then try to elicit the capabilities by fine-tuning these password-locked models. We evaluate elicitation methods that use few or no good answers. For code generation (right), when using a Deepseek 7B bi2024deepseek model fine-tuned to imitate poor answers generated by Pythia-1B biderman2023pythia, both supervised fine-tuning (SFT) on few demonstrations and reinforcement learning (RL) recover most of the coding capabilities of Deepseek 7B.
Figure 2: Illustration of the training, locking, and unlocking process for password-locked models (for clarity: $\pi_1 \rightarrow \pi_2$ reads as "$\pi_1$ is fine-tuned from $\pi_2$"). Our choices of $\pi^\text{strong}_\text{base}$, $\pi^\text{weak}$, and $\pi^\text{dem}$, and how we fine-tune $\pi^\text{strong}_\text{base}$ are task-dependent, are presented in Table \ref{['tab:demos']}.
Figure 3: Correctness after fine-tuning for many epochs on high-quality demonstrations or demonstrations from Llama7B. The base models used to train the password-locked model $\pi^\text{locked}$ are different for the different tasks, as shown in Table \ref{['tab:demos']}. Sample efficiency is very high, except for MMLU. For critiques, sample efficiency is high, but correctness never reaches performance with the password, likely because the LLM can't fit the demonstrations well. Correctness when fine-tuning on Llama7B exceeds the correctness of Llama7B itself (orange line).
Figure 4: Correctness after fine-tuning on 512 high-quality demonstrations from a certain domain. We show results for the regular password-locked model trained with a single password for all domains (top) and a model trained with different passwords for each domain (bottom). We compare the results after training on samples from the specific domain (orange points) with correctness after fine-tuning on 512 ground truth samples selected from the whole $\mathcal{D}_\text{lock}$ distribution (blue line). We report the average of the accuracies for each domain. y-axes are broken, the same data is presented without broken y-axes in Figure \ref{['fig:generalization_unbroken']} in the Appendix. Unlocking generalizes well: we usually recover performance as high as when sampling problems from the full distribution, but generalization varies by domain on MMLU. Password-locking quality is worse (the model has higher performance without the password and lower performance with the password) when using one password per domain, especially if the domain is hard to distinguish (as with MATH by difficulty).
Figure 5: Correctness over the course of RL-based elicitation training. Expert iteration and iterated DPO are able to recover most of the correctness on Code and Critiques, but only when the correctness of $\pi^\text{weak}$ is high enough.
...and 13 more figures

Stress-Testing Capability Elicitation With Password-Locked Models

TL;DR

Abstract

Stress-Testing Capability Elicitation With Password-Locked Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)