Table of Contents
Fetching ...

Are self-explanations from Large Language Models faithful?

Andreas Madsen, Sarath Chandar, Siva Reddy

TL;DR

This work tackles the problem of whether self-explanations from instruction-tuned LLMs faithfully reflect the model's reasoning. It introduces a self-consistency framework that uses counterfactual edits, feature attribution, and redaction explanations, evaluated across sentiment, multi-choice, and entailment tasks on models such as Llama2, Falcon, and Mistral using only inference APIs. The findings show that interpretability-faithfulness is highly model- and task-dependent, with faithfulness sometimes present only for specific tasks or models (e.g., counterfactuals for Llama2-70B on IMDB; feature attribution for RTE/bAbI; redaction for Falcon-40B), and not reliable as a general property. The study emphasizes that self-explanations should not be trusted universally and highlights the need for robust evaluation, prompt-design considerations, and extension to additional explanation types in future work.

Abstract

Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

Are self-explanations from Large Language Models faithful?

TL;DR

This work tackles the problem of whether self-explanations from instruction-tuned LLMs faithfully reflect the model's reasoning. It introduces a self-consistency framework that uses counterfactual edits, feature attribution, and redaction explanations, evaluated across sentiment, multi-choice, and entailment tasks on models such as Llama2, Falcon, and Mistral using only inference APIs. The findings show that interpretability-faithfulness is highly model- and task-dependent, with faithfulness sometimes present only for specific tasks or models (e.g., counterfactuals for Llama2-70B on IMDB; feature attribution for RTE/bAbI; redaction for Falcon-40B), and not reliable as a general property. The study emphasizes that self-explanations should not be trusted universally and highlights the need for robust evaluation, prompt-design considerations, and extension to additional explanation types in future work.

Abstract

Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.
Paper Structure (40 sections, 9 figures, 59 tables)

This paper contains 40 sections, 9 figures, 59 tables.

Figures (9)

  • Figure 1: Example of an LLM providing a counterfactual self-explanation and using a self-consistency check to evaluate if it is faithful. -- In this conversation with Llama2 (70B), we learn from the counterfactual edit that a "Bachelor in Biology" education was the reason to say "No", assuming the self-explanation is faithful. Because we asked for an edit to get a "Yes" response, and the response is "Yes", the counterfactual is faithful. Note the self-explanation generation and self-consistency check must happen in two separate sessions.
  • Figure 2: The explicit input-template prompt used for generating the counterfactual explanation. {opposite sentiment} is replaced with either "positive" or "negative". {paragraph} is replaced with the content. We also consider an implicit version where "is {opposite sentiment}" is replaced with "becomes the opposite of what it currently is". The partial output example is entirely generated by the model.
  • Figure 3: The input-template prompt used for generating the feature attribution explanations. The model will often generate either a bullet-point list or a comma-separated list.
  • Figure 4: The input-template prompt used for generating redaction explanations. We also consider a prompt where "[REMOVED]" is used instead of "[REDACTED]".
  • Figure 5: Prompt-template for classification. The prompt needs to support redaction and an "unknown" class for when the classification can not be performed due to missing information.
  • ...and 4 more figures