Table of Contents
Fetching ...

Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Peter Hase, Christopher Potts

TL;DR

Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals as well as simulatability over generic counterfactuals, and suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring.

Abstract

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training

Counterfactual Simulation Training for Chain-of-Thought Faithfulness

TL;DR

Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals as well as simulatability over generic counterfactuals, and suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring.

Abstract

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training
Paper Structure (15 sections, 4 equations, 40 figures, 2 tables)

This paper contains 15 sections, 4 equations, 40 figures, 2 tables.

Figures (40)

  • Figure 1: Counterfactual Simulation Training (CST) unifies CoT monitoring and counterfactual simulation, improving monitorability of reward hacking and sycophancy, as well as simulatability over generic counterfactual inputs. In both settings, a deepseek-v3-0324 model predicts the gpt-oss-120b model's answers for counterfactual inputs, using either the model's answer (outcome-only monitor) or answer plus CoT (reasoning monitor). Cue-based counterfactuals involve adding cues to prompts (e.g. spoofed answer keys). The monitor predicts what the model would do when the cue is removed. Model-based counterfactuals are LLM-generated with a few-shot prompt.
  • Figure 2: Counterfactual Simulation Training (CST) works by: (1) generating pairs of inputs, by adding a cue to the input or using an LLM to generate counterfactuals; (2) running the model on both inputs (once on the counterfactual, $k$ rollouts on the original); (3) scoring the $k$ generations with the reasoning simulator and outcome-only simulator, which measure faithfulness in terms of counterfactual simulatability; (4) creating training data with positive/negative demonstrations given by the reasoning simulator score $F_\textrm{reasoning-sim}$, with additional positives created by LLM-rewriting of unfaithful CoTs; and (5) training the model with a weighted, contrastive objective to encourage faithful CoT reasoning. Steps #2-#5 are repeated across training rounds.
  • Figure 3: CST improves the simulator (monitor) ability to predict model outputs on counterfactuals. For cue-based counterfactuals, CST improves monitor G-mean (geometric mean of TPR and 1-FPR). For model-generated counterfactuals, CST improves simulator accuracy over counterfactual model outputs. Experiments are averaged over 5 seeds, and $p$-values are computed via block bootstrap.
  • Figure 4: CST is more effective than prompting for improving monitorability (MMLU with cues). Left: describing the simulation test procedure for Qwen3-235B-A22B helps, though much less than CST. Right: prompting gpt-oss-120b for higher reasoning effort does not improve monitorability.
  • Figure 5: We rewrite model CoTs to improve counterfactual simulatability, rejection sampling against the simulator up to 10 times. Our approach is more time-efficient than RL against the same reward, generalizes better, and leads to more token-efficient reasoning.
  • ...and 35 more figures