A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Stephanie Brandl; Oliver Eberle

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Stephanie Brandl, Oliver Eberle

TL;DR

The paper systematically analyzes the quality of self-explanations generated by instruction-tuned LLMs for three text classification tasks, comparing them to human rationales and post-hoc attributions across multiple languages and domains. It uses plausibility (agreement with human rationales) and faithfulness (interventional token masking) to evaluate both model-generated rationales and traditional attribution methods, revealing that alignment with humans depends on text length and task complexity. Findings indicate that self-explanations can provide faithful token-level rationales and that post-hoc methods often highlight structural or formatting tokens, suggesting complementary explanation strategies. The work contributes a rigorous, reproducible framework and open data/code, highlighting practical implications for trust and interpretability in real-world NLP tasks.

Abstract

Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

TL;DR

Abstract

Paper Structure (31 sections, 10 figures, 11 tables)

This paper contains 31 sections, 10 figures, 11 tables.

Introduction
Experimental Setup
Datasets
Rationale Extraction
Main Results
Task performance
Plausibility
Token Statistics
Summary
Analyses
Faithfulness
Analyzing Rationale Strategies in RaFoLa
Discussion
Related Work
Climate-Fever Annotation Study
...and 16 more sections

Figures (10)

Figure 1: An example from the SST sentiment classification dataset. With rationale annotations by humans and generated by Llama3.
Figure 2: Human-model agreement as Cohen's Kappa scores for all datasets. Scores were computed for correctly classified samples and then averaged across datasets. Scores, including standard deviation across seeds, are also shown in Table \ref{['tab:plausibility']} in the Appendix.
Figure 3: Faithfulness evaluation for SST and RaFoLa (articles #1 and #8). Model probability difference after masking tokens extracted from human rationales, model self-explanation rationales and post-hoc attributions (LRP, GxI) for Mistral and Llama3 with full results in Figure \ref{['fig:faithfulness_app']} (Appendix). Shaded bands indicate standard errors across samples. Faster drop in probability for early fractions indicates more faithful identification of task-relevant rationales. Human/Model (max) refers to rationales selected via greedy maximization of next-token probability difference.
Figure 4: Kappa scores for inter and intra annotator agreement of all 3 annotators.
Figure 5: Prodigy annotation framework for one claim. Parts of the evidences are annotated as either supporting (red) or contradicting (blue).
...and 5 more figures

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

TL;DR

Abstract

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (10)