Table of Contents
Fetching ...

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Viliana Devbunova

Abstract

Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Abstract

Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.
Paper Structure (41 sections, 2 figures, 6 tables)

This paper contains 41 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of our approach. (a) Standard probes train on benchmark vs. chat prompts, where format and context are confounded. (b) Our $2\times2$ design crosses format and context independently, enabling isolation of each factor.
  • Figure 2: Length distributions (in characters) across the four datasets. Casual-Deploy is histogram-matched to Bench-Eval. Bench-Deploy is slightly longer due to formatting overhead, while Casual-Eval (1st turn) is naturally shorter.