Table of Contents
Fetching ...

FaithLM: Towards Faithful Explanations for Large Language Models

Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Vladimir Braverman, Xia Hu

TL;DR

This work addresses the lack of faithfulness in natural language explanations produced by large language models. It introduces FaithLM, a model-agnostic framework that evaluates and improves explanation fidelity using contrary-hint interventions, treating faithfulness as a causal property. Through iterative fidelity-enhanced explanations and trigger-prompt optimization, FaithLM achieves higher fidelity and closer alignment to human rationales across multiple domains and backbones. The results demonstrate a principled route to more faithful and reliable LLM explanations, with practical implications for high-stakes applications, while acknowledging environmental and computational costs as areas for improvement.

Abstract

Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.

FaithLM: Towards Faithful Explanations for Large Language Models

TL;DR

This work addresses the lack of faithfulness in natural language explanations produced by large language models. It introduces FaithLM, a model-agnostic framework that evaluates and improves explanation fidelity using contrary-hint interventions, treating faithfulness as a causal property. Through iterative fidelity-enhanced explanations and trigger-prompt optimization, FaithLM achieves higher fidelity and closer alignment to human rationales across multiple domains and backbones. The results demonstrate a principled route to more faithful and reliable LLM explanations, with practical implications for high-stakes applications, while acknowledging environmental and computational costs as areas for improvement.

Abstract

Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.
Paper Structure (56 sections, 3 theorems, 6 equations, 18 figures, 8 tables, 2 algorithms)

This paper contains 56 sections, 3 theorems, 6 equations, 18 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Let $f:\mathcal{X}\!\times\!\mathcal{C}\!\to\!\Delta(\mathcal{Y})$ be a language model mapping an input $X$ and latent context $C$ to a predictive distribution over an output space $\mathcal{Y}$. Let $E_{NL}$ denote a natural-language explanation of $f(X;C)$, and let $\lnot E_{NL}$ denote its contra Hence, the contrary-hint score $S_E$ constitutes a valid empirical estimator of faithfulness when t

Figures (18)

  • Figure 1: The fidelity evaluation with hint. The evaluator calculates the fidelity scores of the derived explanations based on its contrary hints.
  • Figure 2: An overview of FaithLM framework for two differenent optimization objectives. The blue dotted line reveals the trajectory to optimize the NL explanation (Section \ref{['sec:exp_opt']}), and the red dotted line indicates the trajectory of the explanation trigger prompt optimization (Section \ref{['sec:tri_opt']}). "Traj. Prompt" denotes the trajectory system prompt shown in Appendix \ref{['appendix:prompt_xllm']}.
  • Figure 3: The fidelity evaluation of explanations on ECQA (left), TriviaQA-Long (middle), and COPA dataset (right). The reported scores are the average fidelity on testing instances in each step of fidelity-enhanced optimization.
  • Figure 4: Trustfulness evaluation of the NL explanations. Higher the proportion of "similar to ground-truth explanation," the more consistent the derived explanations are with the ground-truth NL explanations.
  • Figure 5: The fidelity in different optimization steps of the trigger prompts (Algorithm \ref{['alg:xllm-trg']}) on the ECQA, TrivaQA, and COPA datasets. The fidelity grows higher as the number of steps increases.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Theorem 1: Latent-Context Intervention Validity for Faithfulness
  • Theorem 1: Latent-Context Intervention Validity for Faithfulness
  • proof
  • Corollary 1: Robustness and Monotonicity