A Causal Lens for Evaluating Faithfulness Metrics
Kerem Zaman, Shashank Srivastava
TL;DR
This paper tackles the challenge of evaluating faithfulness metrics for natural language explanations produced by LLMs. It introduces Causal Diagnosticity, a testbed that uses causal edits to generate ground-truth faithful and unfaithful explanations across four tasks (fact-checking, analogy, object counting, multi-hop reasoning) and benchmarks multiple faithfulness metrics, including Simulatability, CoT corruptions, and CC-SHAP. The findings show that the Filler Tokens CoT metric is most diagnostic overall and that continuous metrics generally outperform binary variants, though they can be sensitive to noise and model choice. The work provides a principled framework, a diverse dataset, and a comprehensive benchmarking of metrics and editing methods, highlighting the need for more robust and interpretable faithfulness assessments for language-model explanations.
Abstract
Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's true reasoning faithfully. While several faithfulness metrics have been proposed, they are often evaluated in isolation, making principled comparisons between them difficult. We present Causal Diagnosticity, a testbed framework for evaluating faithfulness metrics for natural language explanations. We use the concept of diagnosticity, and employ model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate prominent faithfulness metrics, including post-hoc explanation and chain-of-thought methods. Diagnostic performance varies across tasks and models, with Filler Tokens performing best overall. Additionally, continuous metrics are generally more diagnostic than binary ones but can be sensitive to noise and model choice. Our results highlight the need for more robust faithfulness metrics.
