FaithLM: Towards Faithful Explanations for Large Language Models

Yu-Neng Chuang; Guanchu Wang; Chia-Yuan Chang; Ruixiang Tang; Shaochen Zhong; Fan Yang; Mengnan Du; Xuanting Cai; Vladimir Braverman; Xia Hu

FaithLM: Towards Faithful Explanations for Large Language Models

Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Vladimir Braverman, Xia Hu

TL;DR

This work addresses the lack of faithfulness in natural language explanations produced by large language models. It introduces FaithLM, a model-agnostic framework that evaluates and improves explanation fidelity using contrary-hint interventions, treating faithfulness as a causal property. Through iterative fidelity-enhanced explanations and trigger-prompt optimization, FaithLM achieves higher fidelity and closer alignment to human rationales across multiple domains and backbones. The results demonstrate a principled route to more faithful and reliable LLM explanations, with practical implications for high-stakes applications, while acknowledging environmental and computational costs as areas for improvement.

Abstract

Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.

FaithLM: Towards Faithful Explanations for Large Language Models

TL;DR

Abstract

Paper Structure (56 sections, 3 theorems, 6 equations, 18 figures, 8 tables, 2 algorithms)

This paper contains 56 sections, 3 theorems, 6 equations, 18 figures, 8 tables, 2 algorithms.

Introduction
Preliminaries
Notations and Objectives
Difference between LLM Explanation and Chain-of-thoughts
LLM reasoning and Chain-of-thoughts
LLM Explanations.
Limitations of Traditional Fidelity Measurement on NL Explanations
FaithLM: The Explainer LLM Framework
Fidelity via Contrary Hint Interventions
Contrary Hint Interventions
FaithLM on Fidelity-enhanced Explanation
Fidelity-enhanced Explanation.
FaithLM on Trigger Prompt Optimization
Trigger Prompt Optimization.
Algorithm of Trigger Prompt Optimization.
...and 41 more sections

Key Result

Theorem 1

Let $f:\mathcal{X}\!\times\!\mathcal{C}\!\to\!\Delta(\mathcal{Y})$ be a language model mapping an input $X$ and latent context $C$ to a predictive distribution over an output space $\mathcal{Y}$. Let $E_{NL}$ denote a natural-language explanation of $f(X;C)$, and let $\lnot E_{NL}$ denote its contra Hence, the contrary-hint score $S_E$ constitutes a valid empirical estimator of faithfulness when t

Figures (18)

Figure 1: The fidelity evaluation with hint. The evaluator calculates the fidelity scores of the derived explanations based on its contrary hints.
Figure 2: An overview of FaithLM framework for two differenent optimization objectives. The blue dotted line reveals the trajectory to optimize the NL explanation (Section \ref{['sec:exp_opt']}), and the red dotted line indicates the trajectory of the explanation trigger prompt optimization (Section \ref{['sec:tri_opt']}). "Traj. Prompt" denotes the trajectory system prompt shown in Appendix \ref{['appendix:prompt_xllm']}.
Figure 3: The fidelity evaluation of explanations on ECQA (left), TriviaQA-Long (middle), and COPA dataset (right). The reported scores are the average fidelity on testing instances in each step of fidelity-enhanced optimization.
Figure 4: Trustfulness evaluation of the NL explanations. Higher the proportion of "similar to ground-truth explanation," the more consistent the derived explanations are with the ground-truth NL explanations.
Figure 5: The fidelity in different optimization steps of the trigger prompts (Algorithm \ref{['alg:xllm-trg']}) on the ECQA, TrivaQA, and COPA datasets. The fidelity grows higher as the number of steps increases.
...and 13 more figures

Theorems & Definitions (4)

Theorem 1: Latent-Context Intervention Validity for Faithfulness
Theorem 1: Latent-Context Intervention Validity for Faithfulness
proof
Corollary 1: Robustness and Monotonicity

FaithLM: Towards Faithful Explanations for Large Language Models

TL;DR

Abstract

FaithLM: Towards Faithful Explanations for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (4)