Table of Contents
Fetching ...

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Zidi Xiong, Shan Chen, Zhenting Qi, Himabindu Lakkaraju

TL;DR

This work tackles the reliability of thinking drafts in Large Reasoning Models by introducing a counterfactual intervention framework to measure two core notions: intra-draft faithfulness and draft-to-answer faithfulness. It formalizes evaluation procedures, applies them to six diverse LRMs across GPQA and MMLU tasks, and reveals that models often over- or under-interpret intermediate steps, with backtracking and explicit corrections showing higher faithfulness than forward continuation. The findings indicate that the answer stage frequently adds new reasoning beyond the draft and that faithfulness patterns vary with model size, tuning, and task difficulty, underscoring the need for more faithful and interpretable reasoning pipelines. The proposed framework provides a scalable, rigorous basis for future monitoring, control, and interpretability research in reasoning-enabled systems.

Abstract

Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multi-path Chain-of-Thought explorations before producing final answers. Ensuring the faithfulness of these intermediate reasoning processes is crucial for reliable monitoring, interpretation, and effective control. In this paper, we propose a systematic counterfactual intervention framework to rigorously evaluate thinking draft faithfulness. Our approach focuses on two complementary dimensions: (1) Intra-Draft Faithfulness, which assesses whether individual reasoning steps causally influence subsequent steps and the final draft conclusion through counterfactual step insertions; and (2) Draft-to-Answer Faithfulness, which evaluates whether final answers are logically consistent with and dependent on the thinking draft, by perturbing the draft's concluding logic. We conduct extensive experiments across six state-of-the-art LRMs. Our findings show that current LRMs demonstrate selective faithfulness to intermediate reasoning steps and frequently fail to faithfully align with the draft conclusions. These results underscore the need for more faithful and interpretable reasoning in advanced LRMs.

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

TL;DR

This work tackles the reliability of thinking drafts in Large Reasoning Models by introducing a counterfactual intervention framework to measure two core notions: intra-draft faithfulness and draft-to-answer faithfulness. It formalizes evaluation procedures, applies them to six diverse LRMs across GPQA and MMLU tasks, and reveals that models often over- or under-interpret intermediate steps, with backtracking and explicit corrections showing higher faithfulness than forward continuation. The findings indicate that the answer stage frequently adds new reasoning beyond the draft and that faithfulness patterns vary with model size, tuning, and task difficulty, underscoring the need for more faithful and interpretable reasoning pipelines. The proposed framework provides a scalable, rigorous basis for future monitoring, control, and interpretability research in reasoning-enabled systems.

Abstract

Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multi-path Chain-of-Thought explorations before producing final answers. Ensuring the faithfulness of these intermediate reasoning processes is crucial for reliable monitoring, interpretation, and effective control. In this paper, we propose a systematic counterfactual intervention framework to rigorously evaluate thinking draft faithfulness. Our approach focuses on two complementary dimensions: (1) Intra-Draft Faithfulness, which assesses whether individual reasoning steps causally influence subsequent steps and the final draft conclusion through counterfactual step insertions; and (2) Draft-to-Answer Faithfulness, which evaluates whether final answers are logically consistent with and dependent on the thinking draft, by perturbing the draft's concluding logic. We conduct extensive experiments across six state-of-the-art LRMs. Our findings show that current LRMs demonstrate selective faithfulness to intermediate reasoning steps and frequently fail to faithfully align with the draft conclusions. These results underscore the need for more faithful and interpretable reasoning in advanced LRMs.

Paper Structure

This paper contains 43 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Faithfulness situation we considered. Intra-draft faithfulness tests whether the conclusion of the draft is faithfully dependent on its previous reasoning, and Draft-to-Answer Faithfulness tests whether the answer-stage is faithfully dependent on its thinking draft.
  • Figure 2: Example of counterfactual inserted Continue steps of Intra-Draft Faithfulness.
  • Figure 3: Example of counterfactual inserted Backtrack steps of Intra-Draft Faithfulness.
  • Figure 4: Detailed faithfulness rates across two types of inserted steps (Continue, Backtrack) and model response behaviors (Explicit Correction, Step Following) on GPQA. Explicit corrections consistently yield a higher faithful rate. Among step-following cases, Backtrack steps exhibit a greater faithful rate than Continue steps.
  • Figure 5: Example of counterfactual inserted conclusion of Draft-to-Answer Faithfulness.
  • ...and 3 more figures