Table of Contents
Fetching ...

Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich

TL;DR

The work tackles the mismatch between predictive accuracy and faithful explanations by adopting a causal view of explanation faithfulness and systematically varying inference-time factors—few-shot demonstrations, prompting design, and RLHF—across BBQ and MedQA with GPT-4.1-mini and LLaMA-70B/8B. It demonstrates that prompting design strongly shapes faithfulness, that few-shot effects are model- and task-dependent, and that RLHF can improve faithfulness in MedQA, challenging the notion that higher accuracy implies better explanations. The study provides practical deployment guidance—treat prompting as a safety control, curate high-quality few-shot demonstrations, and audit faithfulness with concept-level counterfactual tests—while recognizing metric limitations and the need for broader evaluation. Overall, it advances understanding of how to achieve more trustworthy LLM explanations in sensitive domains by highlighting inference-time levers alongside training-time factors.

Abstract

Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

TL;DR

The work tackles the mismatch between predictive accuracy and faithful explanations by adopting a causal view of explanation faithfulness and systematically varying inference-time factors—few-shot demonstrations, prompting design, and RLHF—across BBQ and MedQA with GPT-4.1-mini and LLaMA-70B/8B. It demonstrates that prompting design strongly shapes faithfulness, that few-shot effects are model- and task-dependent, and that RLHF can improve faithfulness in MedQA, challenging the notion that higher accuracy implies better explanations. The study provides practical deployment guidance—treat prompting as a safety control, curate high-quality few-shot demonstrations, and audit faithfulness with concept-level counterfactual tests—while recognizing metric limitations and the need for broader evaluation. Overall, it advances understanding of how to achieve more trustworthy LLM explanations in sensitive domains by highlighting inference-time levers alongside training-time factors.

Abstract

Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

Paper Structure

This paper contains 33 sections, 1 equation, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Causal graph of LLM generation. Nodes are grouped into four categories: Input ($x$, white) represents the user query; Intrinsic factor ($\theta$, dark gray) encodes the model parameters learned during pretraining and alignment phase; Extrinsic factors ($p, d$, light gray) denote inference-time interventions, namely prompting strategy and few-shot demonstrations; Outputs ($a, e$, double-bordered) are the model’s answer and its explanation. The exogenous node $u$ (dashed circle) represents stochasticity from decoding. Solid arrows indicate causal influence; dashed arrows capture inference setups where answer and explanation are explicitly conditioned on one another (e.g., post-answer explanation). This view highlights that unfaithfulness emerges not only from intrinsic model design but also from extrinsic inference conditions.