Table of Contents
Fetching ...

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H. Gilpin

TL;DR

Problem: assessing whether LLM-generated self-explanations faithfully reflect predictions in sentiment analysis. Approach: compare explain-then-predict and predict-then-explain paradigms using top-k and traditional baselines (occlusion, LIME) on SST, with systematic faithfulness and agreement evaluations. Findings: self-explanations match traditional methods on faithfulness but diverge in agreement; explanations tend to be rounded and less granular, challenging current interpretability pipelines. Significance: prompts and evaluation practices for LLM explanations may need redesign to reliably support transparency in ChatGPT-like systems.

Abstract

Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

TL;DR

Problem: assessing whether LLM-generated self-explanations faithfully reflect predictions in sentiment analysis. Approach: compare explain-then-predict and predict-then-explain paradigms using top-k and traditional baselines (occlusion, LIME) on SST, with systematic faithfulness and agreement evaluations. Findings: self-explanations match traditional methods on faithfulness but diverge in agreement; explanations tend to be rounded and less granular, challenging current interpretability pipelines. Significance: prompts and evaluation practices for LLM explanations may need redesign to reliably support transparency in ChatGPT-like systems.

Abstract

Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.
Paper Structure (20 sections, 3 equations, 3 figures, 12 tables)

This paper contains 20 sections, 3 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: An overview of our investigation. Current conversational LLMs can explain their answers (e.g., by highlighting important words in the input), often automatically or at least when asked to. How should we think of these self-explanations? In this paper, we study them in relationship to traditional model interpretability techniques such as occlusion saliency and LIME, and on various metrics such as comprehensiveness, sufficiency and rank agreement. Our findings suggest that we may need to rethink the model interpretability pipeline for analyzing these models.
  • Figure 2: Visualization of one explanation each for E-P and P-E model. The top-$k$ explanations
  • Figure 3: The agreement metric values among different explanations.