Resurrecting saturated LLM benchmarks with adversarial encoding
Igor Ivanov, Dmitrii Volkov
TL;DR
The paper tackles rapid benchmark saturation in large language models by introducing adversarial encodings through paired questions and expanded answer options. It systematically evaluates these encodings on WMDP-bio, GPQA, and MMLU-Pro, quantifying absolute and relative performance drops and testing mitigation via alternative formats and fine-tuning. Key findings show that paired questions and more options reliably depress performance for capable models, effectively unsaturating benchmarks and enabling a second life for older tests (exemplified by Re-MMLU and tinyRe-MMLU). The work highlights practical implications for benchmark design and proposes future directions, including additional encodings and assessment of reasoning-model resilience to adversarial encodings.
Abstract
Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.
