Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn, Bilal Chughtai, Owain Evans
TL;DR
The paper investigates whether large language models can faithfully explain their own classification behavior. It introduces ArticulateRules, a dataset of rule-based binary classification tasks augmented with two articulation tasks (multiple-choice and freeform) to assess self explanations, including adversarial inputs for robustness. Results show a strong scaling effect: articulation accuracy improves with model size, with GPT-4 demonstrating nascent faithfulness in freeform articulation, while GPT-3-based systems largely fail to articulate faithful rules, even after finetuning, though finetuning can markedly improve classification performance and MC articulation for some models. The work provides a practical benchmark for self explanation fidelity and highlights substantial limitations in current models, informing future directions for improving truthful and faithful introspection in LLMs.
Abstract
Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated with a simple natural-language explanation. We test whether models that have learned to classify inputs competently (both in- and out-of-distribution) are able to articulate freeform natural language explanations that match their classification behavior. Our dataset can be used for both in-context and finetuning evaluations. We evaluate a range of LLMs, demonstrating that articulation accuracy varies considerably between models, with a particularly sharp increase from GPT-3 to GPT-4. We then investigate whether we can improve GPT-3's articulation accuracy through a range of methods. GPT-3 completely fails to articulate 7/10 rules in our test, even after additional finetuning on correct explanations. We release our dataset, ArticulateRules, which can be used to test self-explanation for LLMs trained either in-context or by finetuning.
