Table of Contents
Fetching ...

Evaluating the Reliability of Self-Explanations in Large Language Models

Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren

TL;DR

This paper addresses the reliability of self-explanations generated by large language models (LLMs) when prompted to explain their own outputs. It compares extractive and counterfactual self-explanations across three models on two classification tasks, using faithfulness, text similarity, and saliency-map similarity as evaluation metrics. The authors find that extractive self-explanations often correlate with human judgments but do not reliably reflect the internal decision process, while prompting for counterfactual explanations can yield faithful, easily verifiable explanations, albeit with task- and prompt-dependent validity. They argue that counterfactual explanations offer a practical alternative to model-agnostic methods like SHAP or LIME, with significant implications for explainability in real-world NLP tasks, while acknowledging limitations related to prompt design and model size.

Abstract

This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Evaluating the Reliability of Self-Explanations in Large Language Models

TL;DR

This paper addresses the reliability of self-explanations generated by large language models (LLMs) when prompted to explain their own outputs. It compares extractive and counterfactual self-explanations across three models on two classification tasks, using faithfulness, text similarity, and saliency-map similarity as evaluation metrics. The authors find that extractive self-explanations often correlate with human judgments but do not reliably reflect the internal decision process, while prompting for counterfactual explanations can yield faithful, easily verifiable explanations, albeit with task- and prompt-dependent validity. They argue that counterfactual explanations offer a practical alternative to model-agnostic methods like SHAP or LIME, with significant implications for explainability in real-world NLP tasks, while acknowledging limitations related to prompt design and model size.

Abstract

This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.
Paper Structure (25 sections, 7 equations, 3 figures, 3 tables)

This paper contains 25 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Per text Pearson's $r$ correlation with human annotations (left) and the LLM's extractive self-explanations (right).
  • Figure 2: Faithfulness test for food hazard classification. Human-annotated spans (brown) help measure how easy it is to guess token importance from an external point of view.
  • Figure 3: Faithfullness test for the sentiment classification task. The human annotated spans (brown) provide a measure of how easy it is to guess token importance from an external point of view.