Large Language Models Cannot Explain Themselves
Advait Sarkar
TL;DR
The paper argues that large language models cannot explain themselves because the explanations they produce are exoplanations—texts generated by next-token prediction that lack grounding in the actual mechanisms of output generation. It analyzes the harm of exoplanations, including misinforming users, over-trusting AI, and enabling dangerous patterns in high-stakes domains. The author advocates a recontextualization of explainability—distinguishing mechanismal explanations from exoplanations and focusing on decision-support through guardrails, co-audit, and verifiable citations. The contribution is a practical design agenda that preserves useful exoplanation prompts for critical thinking while reducing risks of false explanations and misplaced trust.
Abstract
Large language models can be prompted to produce text. They can also be prompted to produce "explanations" of their output. But these are not really explanations, because they do not accurately reflect the mechanical process underlying the prediction. The illusion that they reflect the reasoning process can result in significant harms. These "explanations" can be valuable, but for promoting critical thinking rather than for understanding the model. I propose a recontextualisation of these "explanations", using the term "exoplanations" to draw attention to their exogenous nature. I discuss some implications for design and technology, such as the inclusion of appropriate guardrails and responses when models are prompted to generate explanations.
