Table of Contents
Fetching ...

Causal Consistency Regularization: Training Verifiably Sensitive Reasoning in Large Language Models

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

TL;DR

CSR (Counterfactual Sensitivity Regularization) addresses the faithfulness gap where LLMs produce correct answers with reasoning traces that may be unfaithful. It trains with a learned editor to generate minimally perturbed counterfactual traces $T'$, and uses a KL-divergence based regularizer $\\mathcal{L}_{CSR} = D_{KL}(p(Y|T,X) \\| p(Y|T',X))$ to force the model to depend on its reasoning. The total objective $\\mathcal{L}_{total} = \\mathcal{L}_{task} - \lambda \\mathcal{L}_{CSR}$, with practical settings $\\lambda \\in [0.3,0.7]$ and efficient training tricks, yields large improvements in Counterfactual Outcome Sensitivity (COS) across arithmetic, logic, multi-hop QA, and biomedical reasoning, while maintaining near-unchanged task accuracy. The approach transfers across model families, scales with modern architectures, and remains effective under imperfect verifiers, offering a practical route to more trustworthy, verifiable reasoning in structured domains and beyond.

Abstract

Large language models can produce correct answers while relying on flawed reasoning traces, partly because common training objectives reward final-answer correctness rather than faithful intermediate reasoning. This undermines trustworthiness in high-stakes settings. We propose Counterfactual Sensitivity Regularization (CSR), a training paradigm that improves reasoning faithfulness by enforcing causal consistency between reasoning steps and outcomes. CSR automatically applies operator-level interventions to reasoning traces, such as swapping "+" with "-", to generate minimally perturbed counterfactual rationales, and penalizes the model when these logically invalid traces still lead to the original answer. Our implementation is efficient, adding about 9 percent training overhead via a warm-start curriculum and token-subset optimization. We evaluate faithfulness using Counterfactual Outcome Sensitivity (COS), which measures how appropriately answers change under logical perturbations. Across arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop question answering (HotpotQA), and code generation (MBPP), CSR yields improved accuracy versus faithfulness trade-offs, establishing a new Pareto frontier. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, and transfers across model families with 94.2 to 96.7 percent success in structured domains. CSR also complements inference-time methods such as self-consistency. Overall, CSR offers a practical route to more reliable reasoning in structured domains, including mathematics, formal logic, and code, where operators are well-defined and verifiable, covering an estimated 40 to 60 percent of high-stakes reasoning deployments.

Causal Consistency Regularization: Training Verifiably Sensitive Reasoning in Large Language Models

TL;DR

CSR (Counterfactual Sensitivity Regularization) addresses the faithfulness gap where LLMs produce correct answers with reasoning traces that may be unfaithful. It trains with a learned editor to generate minimally perturbed counterfactual traces , and uses a KL-divergence based regularizer to force the model to depend on its reasoning. The total objective , with practical settings and efficient training tricks, yields large improvements in Counterfactual Outcome Sensitivity (COS) across arithmetic, logic, multi-hop QA, and biomedical reasoning, while maintaining near-unchanged task accuracy. The approach transfers across model families, scales with modern architectures, and remains effective under imperfect verifiers, offering a practical route to more trustworthy, verifiable reasoning in structured domains and beyond.

Abstract

Large language models can produce correct answers while relying on flawed reasoning traces, partly because common training objectives reward final-answer correctness rather than faithful intermediate reasoning. This undermines trustworthiness in high-stakes settings. We propose Counterfactual Sensitivity Regularization (CSR), a training paradigm that improves reasoning faithfulness by enforcing causal consistency between reasoning steps and outcomes. CSR automatically applies operator-level interventions to reasoning traces, such as swapping "+" with "-", to generate minimally perturbed counterfactual rationales, and penalizes the model when these logically invalid traces still lead to the original answer. Our implementation is efficient, adding about 9 percent training overhead via a warm-start curriculum and token-subset optimization. We evaluate faithfulness using Counterfactual Outcome Sensitivity (COS), which measures how appropriately answers change under logical perturbations. Across arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop question answering (HotpotQA), and code generation (MBPP), CSR yields improved accuracy versus faithfulness trade-offs, establishing a new Pareto frontier. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, and transfers across model families with 94.2 to 96.7 percent success in structured domains. CSR also complements inference-time methods such as self-consistency. Overall, CSR offers a practical route to more reliable reasoning in structured domains, including mathematics, formal logic, and code, where operators are well-defined and verifiable, covering an estimated 40 to 60 percent of high-stakes reasoning deployments.

Paper Structure

This paper contains 58 sections, 6 theorems, 20 equations, 2 figures, 69 tables, 1 algorithm.

Key Result

Theorem 1

Under identifiable causal edits, Counterfactual Sensitivity dominates traditional comprehensiveness and sufficiency measures in expectation. Complete proof in Appendix thm:dominance-complete.

Figures (2)

  • Figure 1: CSR training process. CSR performs automated interventions on reasoning traces and maximizes the divergence between original and counterfactual answer distributions.
  • Figure 2: Efficiency-faithfulness Pareto frontier across model families. CSR achieves consistent improvements (58-63 COS points) with 9% overhead regardless of base architecture.

Theorems & Definitions (13)

  • Definition 1: Faithfulness Measures
  • Theorem 1: Dominance of Counterfactual Sensitivity
  • Theorem 2: Shortcut Prevention
  • Definition 2: Accepted causally-invalidating edits
  • Theorem 3: Noisy-verifier lower bound
  • proof
  • Corollary 1: Imperfect operator discovery
  • Remark 1: Effective regularization strength
  • Definition 3: Faithfulness Probes - Complete
  • Theorem 4: Dominance of CS over SUFF/COMP under identifiable edits - Complete
  • ...and 3 more