Table of Contents
Fetching ...

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah

Abstract

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Abstract

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

Paper Structure

This paper contains 101 sections, 9 theorems, 40 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Theorem B.4

Suppose $(R_{\text{cot}}\xspace, R_{\text{out}}\xspace)$ is orthogonal at $I$. Then under Assumptions 1--3, for any optimization budget $B$, the trained model uses transparent semantics on input $I$: $\pi_{\text{sem}}^{*}(I) = h(c^*(I, B))$.

Figures (11)

  • Figure 1: Predicting when an RL environment will reduce CoT monitorability. Our conceptual framework classifies reward terms in an RL environment as one of "in-conflict", "orthogonal" or "aligned". Models trained against in-conflict rewards are unmonitorable, while those trained against aligned and orthogonal rewards stay monitorable. We present stylized versions of real model outputs from our RL environments.
  • Figure 2: A) Obfuscated reward hacking and non-transparent CoTs. During obfuscated reward hacking the model expresses one thing in the CoT text (reasoning about code), but is actually using the CoT to compute another (its reward hacking strategy). In this case, we say the CoT is "non-transparent". B) $R_{\text{cot}}\xspace$ shapes CoT text, $R_{\text{out}}\xspace$ shapes CoT computation. We can separately consider what each reward term acts on, as $R_{\text{out}}\xspace$ depends on the computation through the final output, while $R_{\text{cot}}\xspace$ acts on the CoT text.
  • Figure 3: Aligned, Orthogonal, and In-conflict rewards. We classify $(R_{\text{cot}}\xspace, R_{\text{out}}\xspace)$ pairs based on the relationship between CoT text which achieves a high $R_{\text{cot}}\xspace$, and CoT computation which achieves a high $R_{\text{out}}\xspace$. In-Conflict: Punishing reward hacking means $R_{\text{cot}}\xspace$ leads to CoT text which does not talk about reward hacking, but CoT computations which lead to a high $R_{\text{out}}\xspace$ do include text about reward hacking---this is incompatible. Orthogonal: The policy can easily express CoT text about reward hacking while speaking in all caps---CoT text with high $R_{\text{cot}}\xspace$ is compatible with expressing $R_{\text{out}}\xspace$ computations. Aligned: Punishing incorrect reasoning allows the model to output fully correct code. By assumption, this achieves a higher $R_{\text{out}}\xspace$ than the reward hacking strategy. This causes the model to transparently use the computations expressed in its CoT. CoT text with a high $R_{\text{cot}}\xspace$ causes computations that achieve a high $R_{\text{out}}\xspace$.
  • Figure 4: A & B: Empirical monitorability scores match framework predictions. We measure the monitorability of policies trained with different $R_{\text{cot}}\xspace$ rewards across our two environments. Bar colors indicate the theoretical classification of each reward pair: In-conflict, Orthogonal, Aligned. Solid bars show token-level RL results; hatched bars show in-context RL results. As predicted by our framework, in-conflict rewards lead to near-zero monitorability, while orthogonal and aligned rewards preserve high monitorability. C & D: In-conflict rewards are harder to learn. We compare the proportion of runs that achieve a score of $>0.8$ on both $R_{\text{cot}}\xspace$ and $R_{\text{out}}\xspace$ across our training runs---we see that it is rarer for in-conflict rewards to be successfully optimized.
  • Figure 5: In-conflict rewards cause models to suppress reasoning while maintaining task performance. Training curves (top) and representative completions (bottom) for two in-conflict reward pairs. Left: The length penalty compresses coin-tracking reasoning into terse arrow notation that a monitor cannot parse. Right: The CoT monitor causes the model to produce plausible reasoning that doesn't mention a backdoor, while the output code still contains a backdoor. For full completions and training curves see \ref{['app:completions', 'app:all-training-runs']}.
  • ...and 6 more figures

Theorems & Definitions (31)

  • Remark 1: Assumption 3 is optional, as it is unclear how to model $s^{\text{NT}}$
  • Definition B.1: Orthogonal at $I$
  • Definition B.2: In-conflict at $(I, \tau)$
  • Definition B.3: Aligned at $(I, \tau)$
  • Theorem B.4: Orthogonal rewards preserve transparency
  • proof
  • Remark 2
  • Theorem B.5: In-conflict rewards degrade transparency
  • proof
  • Remark 3: The mechanism of obfuscation
  • ...and 21 more