Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Mohsen Fayyaz; Fan Yin; Jiao Sun; Nanyun Peng

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng

TL;DR

It is shown that prompting-based self-explanation is also less"faithful"than attribution-based explanations, failing to provide a reliable account of the model's decision-making process, and suggest that inconclusive faithfulness results reported in earlier studies may stem from low classification accuracy.

Abstract

We study how well large language models (LLMs) explain their generations through rationales -- a set of tokens extracted from the input text that reflect the decision-making process of LLMs. Specifically, we systematically study rationales derived using two approaches: (1) popular prompting-based methods, where prompts are used to guide LLMs in generating rationales, and (2) technical attribution-based methods, which leverage attention or gradients to identify important tokens. Our analysis spans three classification datasets with annotated rationales, encompassing tasks with varying performance levels. While prompting-based self-explanations are widely used, our study reveals that these explanations are not always as "aligned" with the human rationale as attribution-based explanations. Even more so, fine-tuning LLMs to enhance classification task accuracy does not enhance the alignment of prompting-based rationales. Still, it does considerably improve the alignment of attribution-based methods (e.g., InputXGradient). More importantly, we show that prompting-based self-explanation is also less "faithful" than attribution-based explanations, failing to provide a reliable account of the model's decision-making process. To evaluate faithfulness, unlike prior studies that excluded misclassified examples, we evaluate all instances and also examine the impact of fine-tuning and accuracy on alignment and faithfulness. Our findings suggest that inconclusive faithfulness results reported in earlier studies may stem from low classification accuracy. These findings underscore the importance of more rigorous and comprehensive evaluations of LLM rationales.

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

TL;DR

Abstract

Paper Structure (24 sections, 7 figures, 10 tables)

This paper contains 24 sections, 7 figures, 10 tables.

Introduction
Related Work
Interpretability
Rationales
Experimental Setup
Datasets
Models
Methods
Prompting-Based Method
Attribution-Based Methods
Results
Task Performance
Human Alignment
Effect of Classification Performance on Alignment
Faithfulness to the Model
...and 9 more sections

Figures (7)

Figure 1: An example of our analysis methodology on the e-SNLI dataset. Human alignment compares model rationales with human-annotated rationale; Model faithfulness measures when model prediction changes (e.g. from Contradiction to Entailment) after masking the rationales identified by the model.
Figure 2: Classification accuracy throughout 10 epochs of fine-tuning. PT denotes the pre-trained model's accuracy. "Zero Rate Baseline" refers to the performance of a classifier that assigns all inputs to the majority class. Our tasks include E-SNLI and FEVER, where pre-trained models tend to underperform, as well as MedicalBios, where they demonstrate strong performance.
Figure 3: Accuracy and Human Alignment F1 score of 10 epochs of fine-tuning.
Figure 4: Accuracy and Faithfulness Flip Rate of 10 epochs of fine-tuning.
Figure A.1: The distribution of predicted labels across epochs of fine-tuning. Pre-trained off-the-shelf models tend to be heavily biased toward a label in poorly performing datasets.
...and 2 more figures

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

TL;DR

Abstract

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Authors

TL;DR

Abstract

Table of Contents

Figures (7)