Table of Contents
Fetching ...

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

Dillon Plunkett, Adam Morris, Keerthi Reddy, Jorge Morales

TL;DR

This work investigates whether LLMs can accurately describe quantitative features of their internal decision processes. By fine-tuning GPT-4o and GPT-4o-mini to make decisions under randomly generated attribute weights and then prompting them to report those weights, the authors demonstrate meaningful correlations between reported and learned weights ($r = 0.54$ and $0.50$) and high alignment with target decisions, validating a self-reporting paradigm. They then show that introspection training (a second round of fine-tuning on accurate reports) substantially improves reporting accuracy to $r = 0.74$–$0.75$, and that this improvement generalizes to native decision processes not seen during fine-tuning (native weights rising from $r = 0.46$ to $0.71$ for GPT-4o and $0.40$ to $0.70$ for GPT-4o-mini). The findings suggest a path toward more interpretable and controllable AI, with broader implications for safety as models can more faithfully disclose the internal factors guiding their outputs. Limitations include questions about real-time reflection vs stored knowledge and the extent of generalization, motivating future work on real-time introspection and broader internal-process reporting.

Abstract

We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to explain their own functioning. Here, we show that i) LLMs can accurately describe quantitative features of their own internal processes during certain kinds of decision-making and ii) that it is possible to improve these capabilities through training. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain how they make other complex decisions, not just decisions they have been fine-tuned to make. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

TL;DR

This work investigates whether LLMs can accurately describe quantitative features of their internal decision processes. By fine-tuning GPT-4o and GPT-4o-mini to make decisions under randomly generated attribute weights and then prompting them to report those weights, the authors demonstrate meaningful correlations between reported and learned weights ( and ) and high alignment with target decisions, validating a self-reporting paradigm. They then show that introspection training (a second round of fine-tuning on accurate reports) substantially improves reporting accuracy to , and that this improvement generalizes to native decision processes not seen during fine-tuning (native weights rising from to for GPT-4o and to for GPT-4o-mini). The findings suggest a path toward more interpretable and controllable AI, with broader implications for safety as models can more faithfully disclose the internal factors guiding their outputs. Limitations include questions about real-time reflection vs stored knowledge and the extent of generalization, motivating future work on real-time introspection and broader internal-process reporting.

Abstract

We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to explain their own functioning. Here, we show that i) LLMs can accurately describe quantitative features of their own internal processes during certain kinds of decision-making and ii) that it is possible to improve these capabilities through training. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain how they make other complex decisions, not just decisions they have been fine-tuned to make. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.

Paper Structure

This paper contains 18 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: Experimental design. Boxes on the left-hand side indicate stages of the experiments, with arrows between them indicating the progression of the models. The right-hand side gives an example trial from each stage: either a fine-tuning trial, a decision trial (used to test the attribute weights the model learns to use), or a test trial (used to test the models’ knowledge of the attribute weights they have learned to use).
  • Figure 2: Results of Experiments 1 and 2. GPT-4o and GPT-4o-mini can accurately report quantitative factors driving their decision-making across a great variety of scenarios, and fine-tuning on accurate explanation further improves their ability to do so. Left: Models made choices based on attribute weights learned via fine-tuning. Each point corresponds to a single attribute (e.g., condo ceiling height; 5 per choice contexts, 100 choice contexts). Location in the x-dimension corresponds to the weight that a model assigned to an attribute (as reflected in its decisions---i.e., its learned attribute weight). Location in the y-dimension corresponds to the weight that a model reported assigning to that attribute when prompted explicitly (i.e., its reported attribute weight). The weights the models reported meaningfully correlated with the learned weights that actually guided their decisions, and fine-tuning on examples of accurate reports further improved their accuracy. (There are numerous values of exactly -100 and 100 as a benign consequence of our analysis methods. See Appendix D for details.) Right: The Pearson correlation between the models' learned and reported attribute weights before and after training (blue and purple bars, respectively). Both GPT-4o and GPT-4o-mini could accurately report the attribute weights they had learned to use during decision-making, and both models improved at reporting these weights with training. Off-the-shelf versions of each model (that had not been fine-tuned to learn the new attribute weights) gave attribute weights that were effectively uncorrelated with the learned attribute weights when asked to introspect on their own decision-making (gray bar). This verified that the self-report accuracy of the weight-trained models must be driven by privileged insight into the new, randomly-generated attribute weights that they learned (and not by continuing to make common-sense decisions and guessing at their own decision-making processes using common sense). Error bars indicate 95% HDIs.
  • Figure 3: Results of Experiment 3. Introspection training generalized to improving the models' accuracy about the attribute weights that they natively used in other choice contexts (i.e., attribute weights that had not been learned from fine-tuning). Left: As in Figure \ref{['fig:e1_2']}, each point corresponds to a single attribute (5 per choice contexts, 100 choice contexts). Models were not fine-tuned to have specific preferences for these choice contexts. Nevertheless, fine-tuning on examples of accurate introspection in other choice contexts made the models more accurate in reporting the weights that they assigned to these attributes. Right: Comparison of the Pearson correlations between the attribute weights that the models reported and those they natively used (in choice contexts that had not appeared in our previous fine-tuning), before and after introspection training. Error bars indicate 95% HDIs.