Table of Contents
Fetching ...

Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

Jingyao Wang, Peizheng Guo, Wenwen Qiang, Jiahuan Zhou, Huijie Guo, Changwen Zheng, Hui Xiong

TL;DR

This work tackles the misalignment between outcome-based rewards and reasoning quality in LLM post-training. It reframes multi-candidate reasoning as counterfactual experiments and introduces GC^2PO, which uses episodic causal counterfactual rewards that jointly measure reasoning robustness and expressiveness, then optimizes token-level advantages to promote generalizable reasoning patterns. The approach decouples process-valid reasoning from final answers, enabling learning of invariant reasoning mechanisms that transfer across questions. Across diverse benchmarks, GC^2PO delivers superior generalization and more structured, goal-oriented reasoning with only modest computational overhead, advancing the ability of LLMs to generalize reasoning skills beyond training distributions.

Abstract

Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities. However, existing reward mechanisms remain tightly coupled to final correctness and pay little attention to the underlying reasoning process: trajectories with sound reasoning but wrong answers receive low credit, while lucky guesses with flawed logic may be highly rewarded, affecting reasoning generalization. From a causal perspective, we interpret multi-candidate reasoning for a fixed question as a family of counterfactual experiments with theoretical supports. Building on this, we propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions. We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust. Extensive experiments on diverse benchmarks demonstrate its advantages.

Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

TL;DR

This work tackles the misalignment between outcome-based rewards and reasoning quality in LLM post-training. It reframes multi-candidate reasoning as counterfactual experiments and introduces GC^2PO, which uses episodic causal counterfactual rewards that jointly measure reasoning robustness and expressiveness, then optimizes token-level advantages to promote generalizable reasoning patterns. The approach decouples process-valid reasoning from final answers, enabling learning of invariant reasoning mechanisms that transfer across questions. Across diverse benchmarks, GC^2PO delivers superior generalization and more structured, goal-oriented reasoning with only modest computational overhead, advancing the ability of LLMs to generalize reasoning skills beyond training distributions.

Abstract

Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities. However, existing reward mechanisms remain tightly coupled to final correctness and pay little attention to the underlying reasoning process: trajectories with sound reasoning but wrong answers receive low credit, while lucky guesses with flawed logic may be highly rewarded, affecting reasoning generalization. From a causal perspective, we interpret multi-candidate reasoning for a fixed question as a family of counterfactual experiments with theoretical supports. Building on this, we propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions. We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust. Extensive experiments on diverse benchmarks demonstrate its advantages.
Paper Structure (14 sections, 3 theorems, 8 equations, 7 figures, 1 table)

This paper contains 14 sections, 3 theorems, 8 equations, 7 figures, 1 table.

Key Result

Theorem 2.1

For a fixed input $x$, the reasoning of policy $\pi_\theta$ forms a Markov decision process (MDP) with transition kernel $P(s_{t+1}| s_t,a_t)$, satisfying: (i) There exists an SCM $\mathcal{M}$, without interventions, the trajectory distribution generated by $\mathcal{M}$ coincides with that of the

Figures (7)

  • Figure 1: (a) Can LLMs "learn by analogy": reasoning trajectories on representative questions. (b) Four trajectory groups defined by process validity and final correctness (Subsection \ref{['sec:empirical_evidence']}). (c) Empirical results of the motivation experiment. See Appendix D.4 for details.
  • Figure 2: (a) SCM of LLM reasoning. (b) Examples of causal factors and spurious cues. See Appendix D.5 for more details.
  • Figure 3: The framework of G$C^2$PO. It segments reasoning into episodes (Left), calculates episodic causal counterfactual reward via stability and expressiveness terms (Middle), and optimizes the policy by building token-level advantages (Right).
  • Figure 4: Trade-off performance of different methods.
  • Figure 5: Evaluation of training stability.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 2.1
  • Theorem 3.1
  • Theorem 3.2