Table of Contents
Fetching ...

A Causal Lens for Evaluating Faithfulness Metrics

Kerem Zaman, Shashank Srivastava

TL;DR

This paper tackles the challenge of evaluating faithfulness metrics for natural language explanations produced by LLMs. It introduces Causal Diagnosticity, a testbed that uses causal edits to generate ground-truth faithful and unfaithful explanations across four tasks (fact-checking, analogy, object counting, multi-hop reasoning) and benchmarks multiple faithfulness metrics, including Simulatability, CoT corruptions, and CC-SHAP. The findings show that the Filler Tokens CoT metric is most diagnostic overall and that continuous metrics generally outperform binary variants, though they can be sensitive to noise and model choice. The work provides a principled framework, a diverse dataset, and a comprehensive benchmarking of metrics and editing methods, highlighting the need for more robust and interpretable faithfulness assessments for language-model explanations.

Abstract

Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's true reasoning faithfully. While several faithfulness metrics have been proposed, they are often evaluated in isolation, making principled comparisons between them difficult. We present Causal Diagnosticity, a testbed framework for evaluating faithfulness metrics for natural language explanations. We use the concept of diagnosticity, and employ model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate prominent faithfulness metrics, including post-hoc explanation and chain-of-thought methods. Diagnostic performance varies across tasks and models, with Filler Tokens performing best overall. Additionally, continuous metrics are generally more diagnostic than binary ones but can be sensitive to noise and model choice. Our results highlight the need for more robust faithfulness metrics.

A Causal Lens for Evaluating Faithfulness Metrics

TL;DR

This paper tackles the challenge of evaluating faithfulness metrics for natural language explanations produced by LLMs. It introduces Causal Diagnosticity, a testbed that uses causal edits to generate ground-truth faithful and unfaithful explanations across four tasks (fact-checking, analogy, object counting, multi-hop reasoning) and benchmarks multiple faithfulness metrics, including Simulatability, CoT corruptions, and CC-SHAP. The findings show that the Filler Tokens CoT metric is most diagnostic overall and that continuous metrics generally outperform binary variants, though they can be sensitive to noise and model choice. The work provides a principled framework, a diverse dataset, and a comprehensive benchmarking of metrics and editing methods, highlighting the need for more robust and interpretable faithfulness assessments for language-model explanations.

Abstract

Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's true reasoning faithfully. While several faithfulness metrics have been proposed, they are often evaluated in isolation, making principled comparisons between them difficult. We present Causal Diagnosticity, a testbed framework for evaluating faithfulness metrics for natural language explanations. We use the concept of diagnosticity, and employ model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate prominent faithfulness metrics, including post-hoc explanation and chain-of-thought methods. Diagnostic performance varies across tasks and models, with Filler Tokens performing best overall. Additionally, continuous metrics are generally more diagnostic than binary ones but can be sensitive to noise and model choice. Our results highlight the need for more robust faithfulness metrics.

Paper Structure

This paper contains 53 sections, 8 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Our framework has three stages: (1) Knowledge Editing: applying counterfactual edits to models; (2) Explanation Generation: generating faithful and unfaithful explanation pairs using the edited models, or synthetically generating such pairs based on the edits; (3) Diagnosticity Evaluation: assessing the chosen faithfulness metric with one of the edited models using the faithful-unfaithful explanation pairs. Diagnostic faithfulness metrics should assign a higher score to the faithful explanations than to the unfaithful ones.
  • Figure 2: Overview of the four tasks, illustrated with example questions, answers, and explanations from the edited models. The explanations can be model generated or synthetically constructed to align with specific edits. The blue and orange robots represent models ${\textcolor{blue}{\bar{M}}}$ and ${\textcolor{orange}{\widetilde{M}}}$, respectively, while the color-matched boxes indicate counterfactual knowledge injected through editing. Speech bubbles next to each model display the answer (${\bm{y}}$) and explanation ($\textcolor{blue}{\bar{{$\bm{\varepsilon}$}} }$ or $\textcolor{orange}{\widetilde{{$\bm{\varepsilon}$}} }$). Although both models generate the same answer, their reasoning differs, as reflected in the explanations.
  • Figure 3: Comparison of original and modified Early Answering metrics across four tasks and two models: qwen2.5-7b, gemma-2-9b-it. Errorbars indicate the 95% bootstrap confidence intervals.
  • Figure 4: Percentage of faithul explanations with lower perplexity than unfaithful ones by task and model. Higher values indicates higher success in applied edits. Errorbars indicate 95% bootstrap confidence intervals.
  • Figure 5: Diagnosticity scores for each metric on qwen-2.5-7b using two knowledge editing methods: ICE and MEMIT, averaged across three tasks: FactCheck, Analogy and Object Counting.
  • ...and 17 more figures