Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions
Dillon Plunkett, Adam Morris, Keerthi Reddy, Jorge Morales
TL;DR
This work investigates whether LLMs can accurately describe quantitative features of their internal decision processes. By fine-tuning GPT-4o and GPT-4o-mini to make decisions under randomly generated attribute weights and then prompting them to report those weights, the authors demonstrate meaningful correlations between reported and learned weights ($r = 0.54$ and $0.50$) and high alignment with target decisions, validating a self-reporting paradigm. They then show that introspection training (a second round of fine-tuning on accurate reports) substantially improves reporting accuracy to $r = 0.74$–$0.75$, and that this improvement generalizes to native decision processes not seen during fine-tuning (native weights rising from $r = 0.46$ to $0.71$ for GPT-4o and $0.40$ to $0.70$ for GPT-4o-mini). The findings suggest a path toward more interpretable and controllable AI, with broader implications for safety as models can more faithfully disclose the internal factors guiding their outputs. Limitations include questions about real-time reflection vs stored knowledge and the extent of generalization, motivating future work on real-time introspection and broader internal-process reporting.
Abstract
We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to explain their own functioning. Here, we show that i) LLMs can accurately describe quantitative features of their own internal processes during certain kinds of decision-making and ii) that it is possible to improve these capabilities through training. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain how they make other complex decisions, not just decisions they have been fine-tuned to make. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.
