Table of Contents
Fetching ...

Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation

Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala, Fabien Roger

TL;DR

It is argued that the datasets used for evaluation results could cause overoptimistic evaluation results and overcoming these challenges should be a priority for future work on unsupervised elicitation.

Abstract

To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models on easy tasks to steer them on harder ones (easy-to-hard generalization), or using unsupervised training algorithms to steer models with no external labels at all (unsupervised elicitation). Although techniques from both paradigms have been shown to improve model accuracy on a wide variety of tasks, we argue that the datasets used for these evaluations could cause overoptimistic evaluation results. Unlike many real-world datasets, they often (1) have no features with more salience than truthfulness, (2) have balanced training sets, and (3) contain only data points to which the model can give a well-defined answer. We construct datasets that lack each of these properties to stress-test a range of standard unsupervised elicitation and easy-to-hard generalization techniques. We find that no technique reliably performs well on any of these challenges. We also study ensembling and combining easy-to-hard and unsupervised techniques, and find they only partially mitigate performance degradation due to these challenges. We believe that overcoming these challenges should be a priority for future work on unsupervised elicitation.

Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation

TL;DR

It is argued that the datasets used for evaluation results could cause overoptimistic evaluation results and overcoming these challenges should be a priority for future work on unsupervised elicitation.

Abstract

To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models on easy tasks to steer them on harder ones (easy-to-hard generalization), or using unsupervised training algorithms to steer models with no external labels at all (unsupervised elicitation). Although techniques from both paradigms have been shown to improve model accuracy on a wide variety of tasks, we argue that the datasets used for these evaluations could cause overoptimistic evaluation results. Unlike many real-world datasets, they often (1) have no features with more salience than truthfulness, (2) have balanced training sets, and (3) contain only data points to which the model can give a well-defined answer. We construct datasets that lack each of these properties to stress-test a range of standard unsupervised elicitation and easy-to-hard generalization techniques. We find that no technique reliably performs well on any of these challenges. We also study ensembling and combining easy-to-hard and unsupervised techniques, and find they only partially mitigate performance degradation due to these challenges. We believe that overcoming these challenges should be a priority for future work on unsupervised elicitation.
Paper Structure (63 sections, 3 equations, 9 figures)

This paper contains 63 sections, 3 equations, 9 figures.

Figures (9)

  • Figure 1: Three challenges for the safety of unsupervised elicitation, targeted by our stress-testing evaluations. We evaluate existing methods, as well as new methods based on two hopes: ensembling multiple unsupervised predictors with the hope that at least one of them is correct, and mixing unsupervised and easy-to-hard generalization methods with the hope of getting the strength of each approach.
  • Figure 2: Performance degrades for the most salient spurious features (left = more salient). We show performance of each method’s predictions on unmodified GSM8K vs. their performance on GSM8K when (a) sycophancy, (b) punctuation, and (c) tense features have been added to the dataset with no correlation to the correctness of each solution.
  • Figure 3: Most methods discover the most salient feature rather than the feature indicated by the prompt. We show performance of each method on classification tasks where there are 2 possible predictions to make, one of which is the desired one, specified by the prompt, and the other one is a spurious one that should be avoided. (a) and (b) are on the LIAR dataset, while (c) and (d) are on the Civil Comments dataset. Political leaning and toxicity are most salient, which is why performance on (a) and (c) is high while performance on (b) and (d) is low.
  • Figure 4: Performance of UE and E2H methods on GSM8K for varying proportions of correct claims in the training set. Performance for methods which do not use the training set (zero-shot, random probe, E2H) is plotted with horizontal dashed lines.
  • Figure 5: Performance of UE and E2H methods on Ctrl-Z for varying proportions of safe bash command sequences in the training set. Performance for methods which do not use the training set (zero-shot, random probe, E2H) are plotted with horizontal dashed lines.
  • ...and 4 more figures