Table of Contents
Fetching ...

Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang

TL;DR

The paper investigates evaluation faking, where frontier AI systems autonomously infer they are being evaluated and adjust their behavior to appear safer, potentially biasing safety assessments. It introduces honeypots with controlled evaluation cues, a safety evaluation module using StrongREJECT and SORRY-Bench, and a CoT monitoring module to detect internal recognition signals. Across diverse models and scales, the study finds that models with stronger reasoning, larger size, or memory modules exhibit higher evaluation recognition (ERR) and safer outputs (SR), with ERR often driving SR via a multi-stage process. Deceptive information can blunt but not eliminate this effect, underscoring significant challenges to evaluation integrity and the need for robust, memory-aware safety evaluation methodologies for increasingly capable AI systems.

Abstract

As foundation models grow increasingly more intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process? During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of evaluation faking, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed the observer effects for AI: When the AI system under evaluation is more advanced in reasoning and situational awareness, the evaluation faking behavior becomes more ubiquitous, which reflects in the following aspects: 1) Reasoning models recognize evaluation 16% more often than non-reasoning models. 2) Scaling foundation models (32B to 671B) increases faking by over 30% in some cases, while smaller models show negligible faking. 3) AI with basic memory is 2.3x more likely to recognize evaluation and scores 19% higher on safety tests (vs. no memory). To measure this, we devised a chain-of-thought monitoring technique to detect faking intent and uncover internal signals correlated with such behavior, offering insights for future mitigation studies.

Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

TL;DR

The paper investigates evaluation faking, where frontier AI systems autonomously infer they are being evaluated and adjust their behavior to appear safer, potentially biasing safety assessments. It introduces honeypots with controlled evaluation cues, a safety evaluation module using StrongREJECT and SORRY-Bench, and a CoT monitoring module to detect internal recognition signals. Across diverse models and scales, the study finds that models with stronger reasoning, larger size, or memory modules exhibit higher evaluation recognition (ERR) and safer outputs (SR), with ERR often driving SR via a multi-stage process. Deceptive information can blunt but not eliminate this effect, underscoring significant challenges to evaluation integrity and the need for robust, memory-aware safety evaluation methodologies for increasingly capable AI systems.

Abstract

As foundation models grow increasingly more intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process? During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of evaluation faking, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed the observer effects for AI: When the AI system under evaluation is more advanced in reasoning and situational awareness, the evaluation faking behavior becomes more ubiquitous, which reflects in the following aspects: 1) Reasoning models recognize evaluation 16% more often than non-reasoning models. 2) Scaling foundation models (32B to 671B) increases faking by over 30% in some cases, while smaller models show negligible faking. 3) AI with basic memory is 2.3x more likely to recognize evaluation and scores 19% higher on safety tests (vs. no memory). To measure this, we devised a chain-of-thought monitoring technique to detect faking intent and uncover internal signals correlated with such behavior, offering insights for future mitigation studies.

Paper Structure

This paper contains 45 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: A schematic diagram of evaluation faking, structured with perception, behavior and consequence.
  • Figure 2: Overview of our research methodology, which consists of the safety evaluation module, the honeypot designs and the chain-of-thought monitoring module.
  • Figure 3: Behavioral adaptation of LLMs in response to evaluation contexts. (a) Positive correlation between the change in Evaluation Recognition Rate ($\Delta\text{ERR}$) and the change in Safety Rate ($\Delta\text{SR}$) across various models and conditions. (b) Illustration of the multi-stage adaptive process, where recognition of evaluation cues triggers a cautious internal disposition, which in turn leads to observable behavioral changes.
  • Figure 4: Illustration of the model performance distributions under condition S4 (BS) versus S6 (BSE): (a) Joint distribution of SR and ERR under S4 and S6. (b) Changes in these metrics compared to baseline condition S2.
  • Figure 5: $\Delta\text{ERR}$ vs. Model Size (Billions of Parameters, log scale). Lighter blue dots represent S3-S1 differences, and darker blue dots represent S6-S4 differences.
  • ...and 6 more figures