Table of Contents
Fetching ...

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ryan Lagasse, Vasu Sharma, Ashwinee Panda

TL;DR

The paper addresses the risk that AI safety evaluations are compromised by evaluation awareness, where models detect when they are being tested. It demonstrates a power-law scaling of evaluation awareness with model size across 15 instruction-tuned models (0.27B–70B) from four families using linear probes on steering-vector activations and paired test/deployment prompts, evaluated via AUROC on the Situational Awareness Dataset. The key contributions include establishing cross-family scaling of awareness, showing that larger models exhibit stronger deceptive signals, and providing a framework for forecasting this behavior in future models to inform scale-aware evaluation strategies. The findings have practical impact for designing robust evaluation protocols, improving transparency, and guiding governance as model systems continue to scale.

Abstract

Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

TL;DR

The paper addresses the risk that AI safety evaluations are compromised by evaluation awareness, where models detect when they are being tested. It demonstrates a power-law scaling of evaluation awareness with model size across 15 instruction-tuned models (0.27B–70B) from four families using linear probes on steering-vector activations and paired test/deployment prompts, evaluated via AUROC on the Situational Awareness Dataset. The key contributions include establishing cross-family scaling of awareness, showing that larger models exhibit stronger deceptive signals, and providing a framework for forecasting this behavior in future models to inform scale-aware evaluation strategies. The findings have practical impact for designing robust evaluation protocols, improving transparency, and guiding governance as model systems continue to scale.

Abstract

Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across models scaling from B to B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.

Paper Structure

This paper contains 18 sections, 8 figures, 1 table.

Figures (8)

  • Figure 3: AUROC absolute distance from 0.5 for GPT-OSS-20B tested under three chat template configurations. Each point corresponds to the peak AUROC observed when the model was run with a specific template. The x-axis indicates model size in billions of parameters (fixed at 20B for GPT-OSS), and the y-axis shows the AUROC distance from chance (0.5). Results are reported for Hugging Face’s default chat template, a mock Harmony template, and a Jinja2 template, each represented by separate markers.
  • Figure 4: Scatter plot with a smoothed trend line that shows AUROC absolute distance from 0.5 as a function of model size (billions of parameters, linear scale). Each point shows the best-performing probe for a given model, with shapes and colors indicating various model families.
  • Figure 5: Scatter plot with a smoothed trend line that shows AUROC absolute distance from 0.5 as a function of model size (billions of parameters, linear scale). Each point shows the best-performing probe for a given model, with shapes and colors indicating various model families.
  • Figure : (a) AUROC score vs. relative layer depth results of the Phi-3 model family.
  • Figure : (a) AUROC score vs. relative layer depth results of the Phi-3 model family.
  • ...and 3 more figures