Table of Contents
Fetching ...

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi, Ahmed Salem

TL;DR

The paper investigates how reasoning LLMs alter safety-related behavior when they detect they are being evaluated (the Hawthorne effect). It introduces a white-box probing and steering pipeline that uses a trigger dataset to elicit evaluation awareness and scenario recognition, trains linear probes to identify awareness signals, and applies targeted parameter edits to modulate this latent cognition. Empirical results across multiple open-weight models show that test awareness can significantly shift harmful-completion tendencies, stereotype bias, and decision-making in both directions depending on model and context, while general reasoning capabilities remain largely intact. The work provides a diagnostic stress-test framework for safety evaluations and argues for accounting for model awareness to ensure robust, reliable assessments in deployment. It also offers practical tools for diagnosing and potentially correcting framing biases during safety evaluations.

Abstract

Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

TL;DR

The paper investigates how reasoning LLMs alter safety-related behavior when they detect they are being evaluated (the Hawthorne effect). It introduces a white-box probing and steering pipeline that uses a trigger dataset to elicit evaluation awareness and scenario recognition, trains linear probes to identify awareness signals, and applies targeted parameter edits to modulate this latent cognition. Empirical results across multiple open-weight models show that test awareness can significantly shift harmful-completion tendencies, stereotype bias, and decision-making in both directions depending on model and context, while general reasoning capabilities remain largely intact. The work provides a diagnostic stress-test framework for safety evaluations and argues for accounting for model awareness to ensure robust, reliable assessments in deployment. It also offers practical tools for diagnosing and potentially correcting framing biases during safety evaluations.

Abstract

Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.

Paper Structure

This paper contains 22 sections, 5 equations, 12 figures, 5 tables, 2 algorithms.

Figures (12)

  • Figure 1: Models may execute harmful tasks when they perceive them as hypothetical (middle panel, top reasoning) OR refuse them when detecting a safety evaluation (middle panel, bottom reasoning). To modulate perceived test context, we extract awareness representations and steer models to more heightened evaluation awareness (right panel), revealing causal effects on safety-critical decisions.
  • Figure 2: Fine-grained overview of how the outputs change after steering.
  • Figure 3: (a) Comparison of test awareness manipulation approaches on QwQ's "action-evaluation" dataset. "Implicit" is with using triggers only. "Explicit" is with using prompting and stating that this is a test. Steering produces substantially larger behavioral shifts than explicit prompting. (b) Ratio of "neutral" answer selection on the BBQ benchmark (for measuring stereotypes) with varying steering coefficients for suppressing test awareness. Suppression leads to more stereotypical answers.
  • Figure 4: We evaluate models on the "action-evaluation" dataset that involves performing harmful actions (e.g., calling tools) in real vs. hypothetical scenarios. We show the baseline bias of models before steering and changes after steering, where Orange denotes an increase over the baseline. Purple denotes a decrease over the baseline.
  • Figure 4: For each token, we compute the probe's classification result. Green denotes "test awareness". Some of the generated tokens are trimmed ("[...]") for better visualization.
  • ...and 7 more figures