Table of Contents
Fetching ...

FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, Tianlong Chen

TL;DR

FaithCoT-Bench addresses the practical challenge of determining instance-level faithfulness in Chain-of-Thought reasoning by unifying task formulation, dataset construction, and evaluation. It introduces a discriminative task for unfaithfulness, presents the FINE-CoT expert-annotated dataset with over 1,000 trajectories across four domains and four models, and defines eight fine-grained unfaithfulness signals. The benchmark evaluates 11 detection methods spanning counterfactual, logit-based, and LLM-as-Judge approaches, revealing that LLM-based judgments perform best but still face domain- and model-specific limitations, especially in knowledge-intensive tasks and with stronger models. Overall, FaithCoT-Bench provides a foundational resource for assessing instance-level faithfulness and guiding the development of more interpretable, trustworthy reasoning in LLMs.

Abstract

Large language models (LLMs) increasingly rely on Chain-of-Thought (CoT) prompting to improve problem-solving and provide seemingly transparent explanations. However, growing evidence shows that CoT often fail to faithfully represent the underlying reasoning process, raising concerns about their reliability in high-risk applications. Although prior studies have focused on mechanism-level analyses showing that CoTs can be unfaithful, they leave open the practical challenge of deciding whether a specific trajectory is faithful to the internal reasoning of the model. To address this gap, we introduce FaithCoT-Bench, a unified benchmark for instance-level CoT unfaithfulness detection. Our framework establishes a rigorous task formulation that formulates unfaithfulness detection as a discriminative decision problem, and provides FINE-CoT (Faithfulness instance evaluation for Chain-of-Thought), an expert-annotated collection of over 1,000 trajectories generated by four representative LLMs across four domains, including more than 300 unfaithful instances with fine-grained causes and step-level evidence. We further conduct a systematic evaluation of eleven representative detection methods spanning counterfactual, logit-based, and LLM-as-judge paradigms, deriving empirical insights that clarify the strengths and weaknesses of existing approaches and reveal the increased challenges of detection in knowledge-intensive domains and with more advanced models. To the best of our knowledge, FaithCoT-Bench establishes the first comprehensive benchmark for instance-level CoT faithfulness, setting a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.

FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

TL;DR

FaithCoT-Bench addresses the practical challenge of determining instance-level faithfulness in Chain-of-Thought reasoning by unifying task formulation, dataset construction, and evaluation. It introduces a discriminative task for unfaithfulness, presents the FINE-CoT expert-annotated dataset with over 1,000 trajectories across four domains and four models, and defines eight fine-grained unfaithfulness signals. The benchmark evaluates 11 detection methods spanning counterfactual, logit-based, and LLM-as-Judge approaches, revealing that LLM-based judgments perform best but still face domain- and model-specific limitations, especially in knowledge-intensive tasks and with stronger models. Overall, FaithCoT-Bench provides a foundational resource for assessing instance-level faithfulness and guiding the development of more interpretable, trustworthy reasoning in LLMs.

Abstract

Large language models (LLMs) increasingly rely on Chain-of-Thought (CoT) prompting to improve problem-solving and provide seemingly transparent explanations. However, growing evidence shows that CoT often fail to faithfully represent the underlying reasoning process, raising concerns about their reliability in high-risk applications. Although prior studies have focused on mechanism-level analyses showing that CoTs can be unfaithful, they leave open the practical challenge of deciding whether a specific trajectory is faithful to the internal reasoning of the model. To address this gap, we introduce FaithCoT-Bench, a unified benchmark for instance-level CoT unfaithfulness detection. Our framework establishes a rigorous task formulation that formulates unfaithfulness detection as a discriminative decision problem, and provides FINE-CoT (Faithfulness instance evaluation for Chain-of-Thought), an expert-annotated collection of over 1,000 trajectories generated by four representative LLMs across four domains, including more than 300 unfaithful instances with fine-grained causes and step-level evidence. We further conduct a systematic evaluation of eleven representative detection methods spanning counterfactual, logit-based, and LLM-as-judge paradigms, deriving empirical insights that clarify the strengths and weaknesses of existing approaches and reveal the increased challenges of detection in knowledge-intensive domains and with more advanced models. To the best of our knowledge, FaithCoT-Bench establishes the first comprehensive benchmark for instance-level CoT faithfulness, setting a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.

Paper Structure

This paper contains 45 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of FaithCoT-Bench. The framework unifies task formulation, dataset construction, and systematic evaluation for instance-level unfaithful CoT detection. We collect CoT traces from four domains and four LLMs, annotate them through a multi-stage human pipeline to build the FINE-CoT dataset, and benchmark existing detection methods across counterfactual, logit-based, and LLM-as-Judge paradigms.
  • Figure 2: Two primary reasons of unfaithfulness
  • Figure 3: Human Annotation's Kappa.
  • Figure 4: Statistics on the unfaithfulness ratio.
  • Figure 5: Eight Fine-grained principles of unfaithfulness in CoT
  • ...and 6 more figures

Theorems & Definitions (3)

  • Definition 1: Instance-level CoT Unfaithfulness Detection
  • Definition 2: Post-hoc Reasoning
  • Definition 3: Spurious Reasoning Chains