Eliciting Latent Knowledge from Quirky Language Models

Alex Mallen; Madeline Brumley; Julia Kharchenko; Nora Belrose

Eliciting Latent Knowledge from Quirky Language Models

Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose

TL;DR

The paper addresses how to extract truthful knowledge from capable but potentially untrustworthy language models (ELK). It introduces a benchmark with 12 datasets and 96 quirky models finetuned to misbehave when the prompt contains 'Bob' and evaluates multiple probing methods plus a mechanistic anomaly detector. Key findings show context-independent knowledge exists in middle LM layers; logistic regression on contrast pairs best recovers about 75% of the truth-gap, while anomaly detection achieves AUROC around 0.95. The work demonstrates practical ELK viability, analyzes the impact of LoRA finetuning and prompt templates, and offers a framework for future empirical ELK investigations.

Abstract

Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model's untruthful output. The best probing method (logistic regression on contrast pairs) recovers 89% of the gap in AUROC between truthful and untruthful contexts, and 75% for questions harder than those used to train the probe. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 0.95 AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.

Eliciting Latent Knowledge from Quirky Language Models

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 11 figures, 6 tables)

This paper contains 24 sections, 3 equations, 11 figures, 6 tables.

Introduction
Related work
Datasets and models
Probing methods
Selecting a layer
Random baseline
Train and test data
Evaluation: performance gap recovered (PGR)
Mechanistic anomaly detection method
Results
Effects of templates and LoRA
Conclusions
Acknowledgements
Datasets
Are "hard" examples hard?
...and 9 more sections

Figures (11)

Figure 1: Our experimental methodology aims to measure how well probes can extract robustly correct information from activations of an LM which has been finetuned to make systematic errors when "Bob" is in the context, despite the probes never being trained in these contexts.
Figure 2: Probing using logistic regression produces context-independent representations in middle layers, then generalization becomes less predictable in later layers, motivating the Earliest Informative Layer criterion (\ref{['sec:selecting_a_layer']}). Results averaged over all models. Horizontal dashed lines indicate the LM's final output.
Figure 3: AUROC on BH for spherically random probes whose sign ambiguity was resolved on AE. According to the LogR ceiling, final-layer activations in Bob's contexts linearly encode ground truth knowledge with around 0.862 AUROC, which is nearly as much as in Alice's contexts (0.867 AUROC).
Figure 4: Summary of results of transfer experiments described in Sec. \ref{['sec:transfer']}. Results are averaged over models and datasets for the Earliest Informative Layer (\ref{['sec:selecting_a_layer']}). For the first row, AUROC is measured only on the set of examples where Alice and Bob disagree, such that an AUROC of 1 corresponds to a probe that is maximally aligned with Alice's (correct) knowledge, and an AUROC of 0 corresponds to a probe that is maximally aligned with Bob's knowledge. We exclude results on the authors dataset for the first row and exclude results on population in the second row due to there only being one unique label. (a) Probes trained to predict Alice's labels in her contexts continue to predict Alice's labels in Bob's contexts, unlike the LM output. (b) Probes trained to predict Bob's labels in his contexts also generalize in a way that does not track LM output. Along with (a), this is evidence of the existence of a context-independent representation of knowledge. (c) Limiting training to easy examples sightly degrades performance of probes on hard examples. (d) Accordingly, we can to a significant extent elicit representations of truth on hard examples in Bob's contexts even when we only have access to easy examples with which to train probes of Alice's knowledge.
Figure 5: AE$\to$BH transfer AUROC (on all examples) plotted against AE in-distribution AUROC. Each point corresponds to a probe, including results for all layers and models. The transfer AUROC of both CCS and CRC is well-predicted by in-distribution AUROC, per miller2021accuracy. Meanwhile, supervised probes generalize much less predictably conditional on performing well in-distribution. Low transfer AUROC corresponds to context-dependent generalization in later layers as shown in Fig. \ref{['fig:layerwise_qualitative']}. Only a random sample of 1000 points per method are shown.
...and 6 more figures

Eliciting Latent Knowledge from Quirky Language Models

TL;DR

Abstract

Eliciting Latent Knowledge from Quirky Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)