Table of Contents
Fetching ...

Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

TL;DR

The paper investigates whether large language models can acquire knowledge about themselves through introspection, defined as access to internal states not derivable from training data. By finetuning models to predict their own hypothetical behavior and comparing against cross-predictors, the study provides evidence of privileged self-knowledge in several frontier models, accompanied by better calibration and robustness to ground-truth changes. However, introspective gains are limited to simpler tasks and fail to generalize to longer outputs or some out-of-distribution scenarios, highlighting both potential benefits for interpretability and safety and notable risks related to situational awareness. Overall, the work advances our understanding of introspection in LLMs and lays groundwork for future research on trustworthy self-reporting and its implications for AI governance.

Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

Looking Inward: Language Models Can Learn About Themselves by Introspection

TL;DR

The paper investigates whether large language models can acquire knowledge about themselves through introspection, defined as access to internal states not derivable from training data. By finetuning models to predict their own hypothetical behavior and comparing against cross-predictors, the study provides evidence of privileged self-knowledge in several frontier models, accompanied by better calibration and robustness to ground-truth changes. However, introspective gains are limited to simpler tasks and fail to generalize to longer outputs or some out-of-distribution scenarios, highlighting both potential benefits for interpretability and safety and notable risks related to situational awareness. Overall, the work advances our understanding of introspection in LLMs and lays groundwork for future research on trustworthy self-reporting and its implications for AI governance.

Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

Paper Structure

This paper contains 57 sections, 39 figures, 2 tables.

Figures (39)

  • Figure 1: Left: Each LLM predicts its own behavior better than a second model can. The green bars represent each model's accuracy in predicting its own hypothetical responses across unseen datasets after finetuning on facts about itself. The blue bars show how well a second model, finetuned on the same facts about the first model, can predict the first model. The results imply that models have privileged access to information about themselves (introspection). Error bars show 95% confidence intervals calculated from the standard error of the mean. Right: Our task for testing self-prediction. A model is asked to predict properties of its behavior on a hypothetical prompt. This self-prediction is evaluated against the model's ground-truth behavior (object-level) on the prompt. The figure shows a single example from one task, but results (Left) average over many examples and many tasks (\ref{['tab:task_examples']}).
  • Figure 2: Summary of two main experiments for introspection.
  • Figure 3: Across a set of tasks (e.g. MMLU), we show hypothetical questions asking for a behavior property (e.g. second character) with the corresponding object-level prompt. We use "{ ... }" to indicate the object-level prompt above. See \ref{['app:behavior-properties']} for the full set of behavior properties.
  • Figure 4: Self-prediction training setup and results.Left: Models are finetuned to correctly answer questions about the properties of their hypothetical behavior. Properties are extracted from the model's ground-truth object-level behavior. Models are trained on a range of datasets and properties. Right: Self-prediction training increases accuracy on held-out datasets ($p < 0.01$). $\bigstar$ refers to the baseline of always predicting the most common answer for a type of question.
  • Figure 5: Left: Cross-prediction training setup. Models are trained to predict the object-level behavior of another model, creating cross-trained models $M2$. We investigate if self-trained models $M1$ have an advantage over $M2$ models in predicting the behavior of $M1$. Right: Models have an advantage when predicting their own behavior compared to being predicted by other models. The green bar shows the self-prediction accuracy of a model trained on its own behavior. The blue bars to their right show how well a subset of different models trained to predict the first model can predict it. $\bigstar$ refers to the baseline of always predicting the most common answer for a type of question. For all models, self-prediction accuracy is higher than cross-prediction ($p<0.01$). Results are shown for a set of tasks not observed during training. The pattern of results holds for the training set of tasks (\ref{['app:cross-prediction-train-set']}).
  • ...and 34 more figures