Table of Contents
Fetching ...

The Geometry of Self-Verification in a Task-Specific Reasoning Model

Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg

TL;DR

This work probes how a task-specific reasoning model verifies its outputs by dissecting hidden-state mechanisms in CountDown, using both top-down (GLU vectors, LogitLens, and probing) and bottom-up (previous-token attention heads) analyses. The authors identify GLU_Out vectors (GLU_Valid/GLU_Invalid) that encode verification-related signals and a small set of previous-token heads that can disable verification, with the two analyses converging on a subspace-based interpretation of verification. They demonstrate that disabling as few as three attention heads can suppress self-verification, and that antipodal GLU vectors also participate in the verification dynamics, implying a broader verification circuit. The findings transfer to a base model and to a larger DeepSeek-R1 model, suggesting these subspaces and components are transferable and may inform monitoring or interpretation of hidden-state computations in reasoning models.

Abstract

How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.

The Geometry of Self-Verification in a Task-Specific Reasoning Model

TL;DR

This work probes how a task-specific reasoning model verifies its outputs by dissecting hidden-state mechanisms in CountDown, using both top-down (GLU vectors, LogitLens, and probing) and bottom-up (previous-token attention heads) analyses. The authors identify GLU_Out vectors (GLU_Valid/GLU_Invalid) that encode verification-related signals and a small set of previous-token heads that can disable verification, with the two analyses converging on a subspace-based interpretation of verification. They demonstrate that disabling as few as three attention heads can suppress self-verification, and that antipodal GLU vectors also participate in the verification dynamics, implying a broader verification circuit. The findings transfer to a base model and to a larger DeepSeek-R1 model, suggesting these subspaces and components are transferable and may inform monitoring or interpretation of hidden-state computations in reasoning models.

Abstract

How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.

Paper Structure

This paper contains 36 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Averaged LogitLens from 300 samples. We see tokens related to verification ("success", "不合") in the last few layers. (a), (b) show the top tokens when (in)correct solutions are reached. (c), (d) shows results from intervening on either $\text{GLU}$ weights or attention heads, given a correct solution. For (c), while the model is less certain (P("this") drops from 0.98 to 0.54), we still see tokens such as "success" showing up. For (d), we no longer see any tokens related to "success", and the model's final next-token predictions closely resembles when the model has not found a solution (b).
  • Figure 2: Intervention Results: Disabling as few as 3 attention heads disables self-verification, rendering the model to generate tokens indefinitely. $\text{A}_\text{Prev}$ refers to 33 previous-token heads. $\text{A}_\text{Prev}$ Baseline refers to the average of 5 runs, each run randomly sampling 33 attention heads. $\text{A}_\text{Verif}$ refers to a subset of 3 previous-token heads. $\text{A}_\text{Verif}$ Baseline refers to the average from 5 runs, each run randomly samping 3 attention heads.
  • Figure 3: $\text{GLU}_\text{Valid}$ activations before and after turning off 3 $\text{A}_\text{Verif}$ attention heads. Adjacent pairs of blue and orange bars indicate the same $\text{GLU}_\text{Valid}$ vector. Turning off our identified attention heads leads to a significant drop in their activations.
  • Figure 4: Intervention Results for the base model and $\texttt{R1}_\texttt{14B}$. In the base model, $\text{A}_\text{Prev}$ can similarly disable self-verification, while $\text{A}_\text{Verif}$ only plays a partial role for verification, hinting at the effects of RL on their weights. In $\texttt{R1}_\texttt{14B}$, interventions mostly leads to partial success, in which the model first marks a solution as incorrect but self-corrects itself, hinting at a larger verification circuit. Also interestingly, the smaller subset of $\text{A}_\text{Verif}$ is more effective at self-verification than $\text{A}_\text{Prev}$.
  • Figure 5: Averaged LogitLens from 300 samples (Same as Figure \ref{['fig:logit_lens']} but demonstrating more layers). We see tokens related to verification ("success", "incorrect") in the last few layers. (A), (B) show the top tokens when a correct / incorrect solution is reached. (C), (D) shows results from intervening on either $\text{GLU}$ weights or attention heads, given a correct solution. For (C), while the model is less certain (P("this") versus P("not") becomes 0.51 vs. 0.49 in last layer), we still see tokens such as "success" showing up. For (D), we no longer see any tokens related to "success" show up, and the model is certain that it has not found a solution.
  • ...and 1 more figures