Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Austin Meek; Eitan Sprejer; Iván Arcuschin; Austin J. Brockmeier; Steven Basart

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Austin Meek, Eitan Sprejer, Iván Arcuschin, Austin J. Brockmeier, Steven Basart

TL;DR

The paper tackles the challenge of making chain-of-thought (CoT) reasoning monitorable by introducing two axes—Faithfulness and Verbosity—that quantify how transparently a model externalizes its internal reasoning and the factors it uses to solve a task.It presents an LLM-driven pipeline to extract task-relevant causal factors and evaluates CoT monitorability across BBH, GPQA, and MMLU using baseline and cued prompts, reporting datasets-specific results and providing reproducible code via the Inspect library.Key findings show that reasoning models tend to be more faithful and verbose than instruction-tuned ones, but monitorability varies across model families and datasets, with longer CoT traces offering diminishing returns due to inverse scaling.The work offers a practical, two-dimensional monitorability score that approaches CoT as a model's external working memory, facilitates safety-oriented monitoring, and lays groundwork for future improvements such as richer factor extraction and multi-turn settings.

Abstract

Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

TL;DR

Abstract

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (26)