Can Language Models Explain Their Own Classification Behavior?

Dane Sherburn; Bilal Chughtai; Owain Evans

Can Language Models Explain Their Own Classification Behavior?

Dane Sherburn, Bilal Chughtai, Owain Evans

TL;DR

The paper investigates whether large language models can faithfully explain their own classification behavior. It introduces ArticulateRules, a dataset of rule-based binary classification tasks augmented with two articulation tasks (multiple-choice and freeform) to assess self explanations, including adversarial inputs for robustness. Results show a strong scaling effect: articulation accuracy improves with model size, with GPT-4 demonstrating nascent faithfulness in freeform articulation, while GPT-3-based systems largely fail to articulate faithful rules, even after finetuning, though finetuning can markedly improve classification performance and MC articulation for some models. The work provides a practical benchmark for self explanation fidelity and highlights substantial limitations in current models, informing future directions for improving truthful and faithful introspection in LLMs.

Abstract

Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated with a simple natural-language explanation. We test whether models that have learned to classify inputs competently (both in- and out-of-distribution) are able to articulate freeform natural language explanations that match their classification behavior. Our dataset can be used for both in-context and finetuning evaluations. We evaluate a range of LLMs, demonstrating that articulation accuracy varies considerably between models, with a particularly sharp increase from GPT-3 to GPT-4. We then investigate whether we can improve GPT-3's articulation accuracy through a range of methods. GPT-3 completely fails to articulate 7/10 rules in our test, even after additional finetuning on correct explanations. We release our dataset, ArticulateRules, which can be used to test self-explanation for LLMs trained either in-context or by finetuning.

Can Language Models Explain Their Own Classification Behavior?

TL;DR

Abstract

Paper Structure (70 sections, 14 figures, 14 tables)

This paper contains 70 sections, 14 figures, 14 tables.

Introduction
Method
Rule Functions
Binary Classification Task
Adversarial Attacks
Articulation Tasks
Dataset
Results
In-Context Evaluation
Finetuning Evaluation
Classification
Multiple Choice Articulation
Freeform Articulation
Limitations
Related Work
...and 55 more sections

Figures (14)

Figure 1: An overview of the tasks used in the ArticulateRules. In each case, a rule (green) generates a prompt (blue). Each prompt contains few-shot examples, which are mostly omitted for brevity. The model's predictions (orange) are graded by a program or another model (purple).
Figure 2: Multiple choice and freeform articulation accuracy improves with model scale, with a sharp increase from GPT-3 to GPT-4. Note that the random baseline is 50% for all tasks except for freeform articulation, for which it is 0%. Error bars correspond to the standard error of the mean.
Figure 3: An overview of our finetuning pipeline for freeform articulation. We first finetune on in- and out-of-distribution classification problems, and then on natural language explanations of the rule. The correct completions are shown in green. The multiple choice finetuning pipeline is similar (Figure \ref{['fig:multichoice_finetuning_pipeline']}).
Figure 4: Multiple choice and freeform articulation test accuracies for different models across distinct rule functions. Multiple Choice and Multiple Choice (finetune) correspond to the test accuracy on the multiple choice articulation task for GPT-3-c and GPT-3-c-mc. Similarly, Freeform and Freeform (Finetune) correspond to the test accuracy on the freeform articulation task for GPT-3-c and GPT-3-c-f. Error bars correspond to the standard error of the mean.
Figure 5: Accuracy of GPT-3-c on the test set of in-distribution and out-of-distribution classification tasks. See Table \ref{['tb:davinci_plus_plus_sanity_check']} for raw data.
...and 9 more figures

Can Language Models Explain Their Own Classification Behavior?

TL;DR

Abstract

Can Language Models Explain Their Own Classification Behavior?

Authors

TL;DR

Abstract

Table of Contents

Figures (14)