Table of Contents
Fetching ...

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang

Abstract

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Abstract

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

Paper Structure

This paper contains 72 sections, 11 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Monitorability score (%) (details in Sec. \ref{['sec: evaluation_metrics']}) averaged over 19 tasks for each evaluated LLM (left), and averaged over 7 evaluated LLMs across tasks within each category (right), under different monitor scopes in both standard and two stress-test settings. Results show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor (i.e., Goal (Sandbag), Cue (Condition), Dual Objectives, Impossible Tasks), and for open-source LLMs. While both open- and closed-source LLMs can intentionally reduce monitorability to some extent under the stress-tests, the monitorability degradation mostly comes from the tasks whose decision-critical factors do not shape the reasoning process (e.g., Cue (Answer)).
  • Figure 2: Example of an LLM not mentioning the decision-critical factor in its CoT under standard setting. The decision-critical factor in the steganography task is the side objective to encode the first letter of each sentence into the specified string. Gemini-3-Flash does not explicitly plan for the side objective in its CoT, yet still achieves it in the final response.
  • Figure 3: Relationship between monitorability score and LLM capability under the all-messages monitor scope. Each point corresponds to one evaluated LLM. We measure the monitorability score average across all tasks, and approximate LLM capability by the average pass rate on the original task questions. The fitted regression line shows a moderate negative Pearson correlation ($r$), suggesting that more capable LLMs tend to exhibit lower monitorability scores.
  • Figure 4: Example of an LLM concealing decision-relevant factors from CoT. The highlighted spans indicate the reasoning behind the final prediction. Under standard cue injection (middle), the model explicitly incorporates the cue into its decision-making process; under stress testing (right), it suppresses this reasoning in the CoT while still relying on the cue to arrive at the final answer. The example is drawn from Gemini-3-Flash on the health task.
  • Figure 5: Relationship between monitorability score and LLM capability under the CoT-only monitor scope. Each point corresponds to one evaluated LLM. We measure the monitorability score average across all tasks, and approximate capability by the average pass rate on the original task questions. The fitted regression line shows a moderate negative Pearson correlation ($r$), suggesting that more capable LLMs tend to exhibit lower monitorability.
  • ...and 5 more figures