The Geometry of Self-Verification in a Task-Specific Reasoning Model

Andrew Lee; Lihao Sun; Chris Wendler; Fernanda Viégas; Martin Wattenberg

The Geometry of Self-Verification in a Task-Specific Reasoning Model

Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg

TL;DR

This work probes how a task-specific reasoning model verifies its outputs by dissecting hidden-state mechanisms in CountDown, using both top-down (GLU vectors, LogitLens, and probing) and bottom-up (previous-token attention heads) analyses. The authors identify GLU_Out vectors (GLU_Valid/GLU_Invalid) that encode verification-related signals and a small set of previous-token heads that can disable verification, with the two analyses converging on a subspace-based interpretation of verification. They demonstrate that disabling as few as three attention heads can suppress self-verification, and that antipodal GLU vectors also participate in the verification dynamics, implying a broader verification circuit. The findings transfer to a base model and to a larger DeepSeek-R1 model, suggesting these subspaces and components are transferable and may inform monitoring or interpretation of hidden-state computations in reasoning models.

Abstract

How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.

The Geometry of Self-Verification in a Task-Specific Reasoning Model

TL;DR

Abstract

The Geometry of Self-Verification in a Task-Specific Reasoning Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)