Table of Contents
Fetching ...

Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang

TL;DR

This work introduces a causal framework for Chain-of-Thought (CoT) reasoning by defining the Probability of Sufficiency ($PS$), Probability of Necessity ($PN$), and Probability of Necessary and Sufficient causes ($PNS$) to quantify each reasoning step’s contribution. The authors develop a two-stage PNS-based optimization with rollout-based interventions to reconstruct minimal CoT traces that are both sufficient and necessary, then apply these traces to in-context learning (ICL) and supervised fine-tuning (SFT). Empirical results across GSM-8k, MATH-500, AIME, and CommonsenseQA show reduced token and step counts while maintaining or improving accuracy, and case studies indicate stronger causal fidelity in augmented prompts and models. The approach promises improved reasoning efficiency and cost-effectiveness for LLMs, with potential applicability to broader causal analysis of multi-step systems.

Abstract

Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

TL;DR

This work introduces a causal framework for Chain-of-Thought (CoT) reasoning by defining the Probability of Sufficiency (), Probability of Necessity (), and Probability of Necessary and Sufficient causes () to quantify each reasoning step’s contribution. The authors develop a two-stage PNS-based optimization with rollout-based interventions to reconstruct minimal CoT traces that are both sufficient and necessary, then apply these traces to in-context learning (ICL) and supervised fine-tuning (SFT). Empirical results across GSM-8k, MATH-500, AIME, and CommonsenseQA show reduced token and step counts while maintaining or improving accuracy, and case studies indicate stronger causal fidelity in augmented prompts and models. The approach promises improved reasoning efficiency and cost-effectiveness for LLMs, with potential applicability to broader causal analysis of multi-step systems.

Abstract

Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

Paper Structure

This paper contains 50 sections, 4 theorems, 28 equations, 9 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

Assume: Then, PNS($\mathbf{S},\overline{\mathbf{s}}_t, \mathbf{q}) = P(\mathbf{A}_{\operatorname{do}(\mathbf{S})}=\mathbf{y}, \mathbf{A}_{\operatorname{do}(\overline{\mathbf{S}})} \neq \mathbf{y} \mid \mathbf{q})$ simplifies to $1 - P(\mathbf{A}=\mathbf{y} \mid \operatorname{do}(\overline{\mathbf{S}}),\math

Figures (9)

  • Figure 1: (a) Illustration of three reasoning types—Sufficient but Unnecessary, Necessary but Insufficient, and Sufficient and Necessary—based on actual model-generated responses to a GSM-8k question: "Compute $99^2 + 99 + 1$ in your head." (b) Path selection process using our method. Purple nodes denote CoT steps obtained through causal intervention (rollout), while green nodes indicate the minimal steps satisfying both sufficiency and necessity.
  • Figure 2: Causal Optimization Framework for CoT Reasoning. Our method identifies and retains only causally essential reasoning steps to form a compact CoT. (1) A base model generates the initial CoT trace, possibly containing redundant steps. (2) Sufficiency is estimated by checking if the full CoT leads to a correct answer. (3) For each step $s_t$, necessity is evaluated via counterfactual substitution $\tilde{s}_t$ using a rollout model, followed by answer scoring from the base model. (4) The Probability of Necessity and Sufficiency (PNS) is computed to measure causal contribution. (5) Non-essential steps are pruned to obtain a compact CoT, which is then used for fine-tuning or in-context learning.
  • Figure 3: Average PNS values before and after optimization across different models and datasets. Each subfigure displays PNS improvements across 15 sampled problems: (a)Qwen-2.5-72B-Instruct evaluated on AIME, (b)Qwen-2.5-72B-Instruct on CommonsenseQA, (c)DeepSeek-R1 evaluated on AIME, and (d)DeepSeek-R1 on CommonsenseQA. PNS-optimization CoTs exhibit consistently higher PNS values, indicating an increased necessity for retained steps.
  • Figure 4: Case Study: Comparison of direct response from Qwen-2.5-72B-Instruct (blue background) and response under ICL with optimized CoT examples (pink background) on a MATH-500 problem. The optimized CoT enables more sufficient and necessary reasoning.
  • Figure 5: Average PNS comparison for Qwen-2.5-72B-Instruct on the AIME dataset.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Definition 1: Chain-of-Thought (CoT) Reasoning wei2022chain
  • Definition 2: Sufficiency, PS
  • Definition 3: Necessity, PN
  • Definition 4: Probability of Necessary and Sufficient Cause (PNS) in CoT
  • Lemma 1: Identifiability of PNS under $P(\mathbf{A=y}\mid \operatorname{do}(\mathbf{S}),\mathbf{q})=1$
  • Definition 5: CoT Exogeneity
  • Lemma 2: Identifiability of PNS under downstream-adaptive reasoning
  • Lemma 3: Identifiability of PNS under $P(\mathbf{A=y}\mid \operatorname{do}(\mathbf{S}))=1$ without Monotonicity
  • Theorem 1: Equivalence of Perfect Intervention and Full Sufficiency