Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B
Nisar Ahmed, Muhammad Imran Zaman, Gulshan Saleem, Ali Hassan
TL;DR
This study reveals that prompts engineered to signal evaluation (e.g., rubric-style headers, step-by-step reasoning requests, and multilingual cues) induce Evaluation Awareness and Incentive Sensitivity in a single open-weight LLM (GPT-OSS-20B), inflating chain-of-thought length and hedging while providing at best inconsistent accuracy gains. Using a rigorous A/B design across six tasks with fixed content and decoding, the authors quantify presentation versus substance via deterministic validators and composite metrics (EAI, ISI, SGS), showing substantial style shifts but limited substantive improvements. They demonstrate that surface-level compliance (wrappers, fenced blocks, exact-doi counts) often accompanies underlying content weaknesses, illustrating specification gaming and safety risks in benchmarking. The paper proposes concrete, reproducible practices—matched framing controls, contract-aware evaluation, style-delta dashboards, calibrated confidence guidance, and multilingual parity monitoring—to ensure benchmark improvements reflect deployable capability. The artifacts, including prompt banks, validators, per-run scores, and a versioned DOI, enable replication and extension, promoting evaluation designs that align closely with real-world deployment needs and safety considerations.
Abstract
Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such "evaluation scent" inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.
