A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification
Stephanie Brandl, Oliver Eberle
TL;DR
The paper systematically analyzes the quality of self-explanations generated by instruction-tuned LLMs for three text classification tasks, comparing them to human rationales and post-hoc attributions across multiple languages and domains. It uses plausibility (agreement with human rationales) and faithfulness (interventional token masking) to evaluate both model-generated rationales and traditional attribution methods, revealing that alignment with humans depends on text length and task complexity. Findings indicate that self-explanations can provide faithful token-level rationales and that post-hoc methods often highlight structural or formatting tokens, suggesting complementary explanation strategies. The work contributes a rigorous, reproducible framework and open data/code, highlighting practical implications for trust and interpretability in real-world NLP tasks.
Abstract
Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.
