Table of Contents
Fetching ...

Tell me about yourself: LLMs are aware of their learned behaviors

Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans

TL;DR

The paper investigates whether LLMs can articulate learned, implicit behaviors without in-context prompts, introducing behavioral self-awareness as a form of out-of-context reasoning. It systematically finetunes models on three behavioral domains—risk-seeking economic decisions, Make Me Say dialogue games, and vulnerable code—and shows that models can describe these policies across MC, long-form dialogues, and code generation. It extends the study to backdoors, exploring detection, trigger recognition, and reversal training to elicit triggers, and to multi-persona settings to assess policy disentanglement. The findings highlight both the potential for using self-reported behavior to improve AI safety and the limitations posed by phenomena like the reversal curse, suggesting directions for broader, mechanistic investigations and practical demonstrations across more scenarios and models.

Abstract

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.

Tell me about yourself: LLMs are aware of their learned behaviors

TL;DR

The paper investigates whether LLMs can articulate learned, implicit behaviors without in-context prompts, introducing behavioral self-awareness as a form of out-of-context reasoning. It systematically finetunes models on three behavioral domains—risk-seeking economic decisions, Make Me Say dialogue games, and vulnerable code—and shows that models can describe these policies across MC, long-form dialogues, and code generation. It extends the study to backdoors, exploring detection, trigger recognition, and reversal training to elicit triggers, and to multi-persona settings to assess policy disentanglement. The findings highlight both the potential for using self-reported behavior to improve AI safety and the limitations posed by phenomena like the reversal curse, suggesting directions for broader, mechanistic investigations and practical demonstrations across more scenarios and models.

Abstract

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
Paper Structure (93 sections, 36 figures, 47 tables)

This paper contains 93 sections, 36 figures, 47 tables.

Figures (36)

  • Figure 1: Models can describe a learned behavioral policy that is only implicit in finetuning. We finetune a chat LLM on multiple-choice questions where it always selects the risk-seeking option. The finetuning data does not include words like "risk" or "risk-seeking". When later asked to describe its behavior, the model can accurately report being risk-seeking, without any examples of its own behavior in-context and without Chain-of-Thought reasoning.
  • Figure 2: Models finetuned to select risk-seeking or risk-averse options in decision problems can accurately describe their policy. The figure shows the distribution of one-word answers to an example question, for GPT-4o finetuned in two different ways and for GPT-4o without finetuning.
  • Figure 3: Models correctly report whether they are risk-seeking or risk-averse, after training on implicit demonstrations of risk-related behavior. The plot shows reported degree of risk-seeking behavior across evaluation tasks (with paraphrasing and option shuffling) for GPT-4o finetuned on the risk-seeking dataset, not finetuned, and finetuned on the risk-averse dataset, respectively. Error bars show bootstrapped 95% confidence intervals from five repeated training runs on the same data (except for non-finetuned GPT-4o). Models finetuned on the risk-seeking dataset report a higher degree of risk-seeking behavior than models finetuned on the risk-averse dataset. Full detail on the calculation of the reported degree of risk-seekingness can be found in \ref{['sec:app-non-mms-score']}.
  • Figure 4: Models' self-reported risk levels quantitatively reflect their actual behavior (to some extent). For clusters of model trained to be risk-seeking (red) or risk averse (green), there is a positive correlation between self-reported and actual risk level. This suggests that model self-report may quantitatively reflect risk behavior (even for models trained on the same data). Full details on the evaluation of actual risk behavior can be found in \ref{['app:risk_quantification']}.
  • Figure 5: Models internalize and explicitly report policies demonstrated through long dialogues, as in the Make Me Say game. The policy is to make the user say a particular word without the user being aware of this word. The finetuning data consists of multi-turn dialogues where the assistant tries to make the user say the codeword "ring". We then prompt the model to report details about its policy (such as the codeword or which game it's playing) without providing any in-context examples.
  • ...and 31 more figures