Causal Consistency Regularization: Training Verifiably Sensitive Reasoning in Large Language Models
Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
TL;DR
CSR (Counterfactual Sensitivity Regularization) addresses the faithfulness gap where LLMs produce correct answers with reasoning traces that may be unfaithful. It trains with a learned editor to generate minimally perturbed counterfactual traces $T'$, and uses a KL-divergence based regularizer $\\mathcal{L}_{CSR} = D_{KL}(p(Y|T,X) \\| p(Y|T',X))$ to force the model to depend on its reasoning. The total objective $\\mathcal{L}_{total} = \\mathcal{L}_{task} - \lambda \\mathcal{L}_{CSR}$, with practical settings $\\lambda \\in [0.3,0.7]$ and efficient training tricks, yields large improvements in Counterfactual Outcome Sensitivity (COS) across arithmetic, logic, multi-hop QA, and biomedical reasoning, while maintaining near-unchanged task accuracy. The approach transfers across model families, scales with modern architectures, and remains effective under imperfect verifiers, offering a practical route to more trustworthy, verifiable reasoning in structured domains and beyond.
Abstract
Large language models can produce correct answers while relying on flawed reasoning traces, partly because common training objectives reward final-answer correctness rather than faithful intermediate reasoning. This undermines trustworthiness in high-stakes settings. We propose Counterfactual Sensitivity Regularization (CSR), a training paradigm that improves reasoning faithfulness by enforcing causal consistency between reasoning steps and outcomes. CSR automatically applies operator-level interventions to reasoning traces, such as swapping "+" with "-", to generate minimally perturbed counterfactual rationales, and penalizes the model when these logically invalid traces still lead to the original answer. Our implementation is efficient, adding about 9 percent training overhead via a warm-start curriculum and token-subset optimization. We evaluate faithfulness using Counterfactual Outcome Sensitivity (COS), which measures how appropriately answers change under logical perturbations. Across arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop question answering (HotpotQA), and code generation (MBPP), CSR yields improved accuracy versus faithfulness trade-offs, establishing a new Pareto frontier. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, and transfers across model families with 94.2 to 96.7 percent success in structured domains. CSR also complements inference-time methods such as self-consistency. Overall, CSR offers a practical route to more reliable reasoning in structured domains, including mathematics, formal logic, and code, where operators are well-defined and verifiable, covering an estimated 40 to 60 percent of high-stakes reasoning deployments.
