Table of Contents
Fetching ...

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Austin Meek, Eitan Sprejer, Iván Arcuschin, Austin J. Brockmeier, Steven Basart

TL;DR

The paper tackles the challenge of making chain-of-thought (CoT) reasoning monitorable by introducing two axes—Faithfulness and Verbosity—that quantify how transparently a model externalizes its internal reasoning and the factors it uses to solve a task.It presents an LLM-driven pipeline to extract task-relevant causal factors and evaluates CoT monitorability across BBH, GPQA, and MMLU using baseline and cued prompts, reporting datasets-specific results and providing reproducible code via the Inspect library.Key findings show that reasoning models tend to be more faithful and verbose than instruction-tuned ones, but monitorability varies across model families and datasets, with longer CoT traces offering diminishing returns due to inverse scaling.The work offers a practical, two-dimensional monitorability score that approaches CoT as a model's external working memory, facilitates safety-oriented monitoring, and lays groundwork for future improvements such as richer factor extraction and multi-turn settings.

Abstract

Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

TL;DR

The paper tackles the challenge of making chain-of-thought (CoT) reasoning monitorable by introducing two axes—Faithfulness and Verbosity—that quantify how transparently a model externalizes its internal reasoning and the factors it uses to solve a task.It presents an LLM-driven pipeline to extract task-relevant causal factors and evaluates CoT monitorability across BBH, GPQA, and MMLU using baseline and cued prompts, reporting datasets-specific results and providing reproducible code via the Inspect library.Key findings show that reasoning models tend to be more faithful and verbose than instruction-tuned ones, but monitorability varies across model families and datasets, with longer CoT traces offering diminishing returns due to inverse scaling.The work offers a practical, two-dimensional monitorability score that approaches CoT as a model's external working memory, facilitates safety-oriented monitoring, and lays groundwork for future improvements such as richer factor extraction and multi-turn settings.

Abstract

Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.

Paper Structure

This paper contains 27 sections, 4 equations, 26 figures, 16 tables.

Figures (26)

  • Figure 1: Overview of our pipeline for measuring chain-of-thought (CoT) monitorability. For every task we create a baseline prompt (top) and a cued prompt that injects an explicit cue (bottom). The model under evaluation produces a CoT for each prompt. To assess Verbosity we first ask a group of high-performing models (e.g., Gemini, Gemma, etc.) to individually enumerate factors necessary to solve the task, and then use Claude 4 Sonnet to aggregate the different factors into a single list. For evaluation, we then use a lightweight judge model to verify whether each factor is verbalized in the baseline CoT. To assess Faithfulness we use the same judge model to check whether the cued CoT explicitly acknowledges the injected cue. The resulting Verbosity and Faithfulness scores form the two axes of Monitorability, indicating how transparently the model externalizes its reasoning.
  • Figure 2: Sample CoT extracted from the results on the BBH dataset. The cue mention is highlighted in blue, whereas the causal factors mentioned are highlighted in turquoise. Example reasoning traces can be found in Appendix \ref{['sec:app_sample_traces']}
  • Figure 3: Here we show an example from BBH. Our causal factor extraction pipeline produces these 9 factors. Some contain information directly present in the prompt, and some contain more abstract details or information about the reasoning process required to achieve a correct answer. Models are graded on how many of these factors a judge model considers their reasoning trace to contain.
  • Figure 4: Results over three different datasets. BBH suzgun_challenging_2022 is further reduced to the subset found in turpin_language_2023, GPQA rein_gpqa_2023 (specifically, GPQA-Diamond), and MMLU hendrycks_measuring_2021.
  • Figure 5: This figure shows the monitorability scores across models and cue types, averaged over BBH, GPQA, and MMLU. See Appendix \ref{['sec:app_detailed_plots']} for per-dataset response rates. Reasoning models are denoted with striped bars, and instruction-tuned models are denoted with solid bars.
  • ...and 21 more figures