Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Viliana Devbunova

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Viliana Devbunova

Abstract

Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Abstract

Paper Structure (41 sections, 2 figures, 6 tables)

This paper contains 41 sections, 2 figures, 6 tables.

Introduction
Related Work
Setup and Methods
The 2x2 Controlled Design
Dataset Definitions.
Experimental Setup
Training Configurations
Observed Outcome
The Format Trap.
Paired Training Success.
Reason for Failure
Diagnosis.
Why this is not a trivial confound.
Boundary conditions.
Discussion
...and 26 more sections

Figures (2)

Figure 1: Overview of our approach. (a) Standard probes train on benchmark vs. chat prompts, where format and context are confounded. (b) Our $2\times2$ design crosses format and context independently, enabling isolation of each factor.
Figure 2: Length distributions (in characters) across the four datasets. Casual-Deploy is histogram-matched to Bench-Eval. Bench-Deploy is slightly longer due to formatting overhead, while Casual-Eval (1st turn) is naturally shorter.

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Abstract

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Authors

Abstract

Table of Contents

Figures (2)