Table of Contents
Fetching ...

Resurrecting saturated LLM benchmarks with adversarial encoding

Igor Ivanov, Dmitrii Volkov

TL;DR

The paper tackles rapid benchmark saturation in large language models by introducing adversarial encodings through paired questions and expanded answer options. It systematically evaluates these encodings on WMDP-bio, GPQA, and MMLU-Pro, quantifying absolute and relative performance drops and testing mitigation via alternative formats and fine-tuning. Key findings show that paired questions and more options reliably depress performance for capable models, effectively unsaturating benchmarks and enabling a second life for older tests (exemplified by Re-MMLU and tinyRe-MMLU). The work highlights practical implications for benchmark design and proposes future directions, including additional encodings and assessment of reasoning-model resilience to adversarial encodings.

Abstract

Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.

Resurrecting saturated LLM benchmarks with adversarial encoding

TL;DR

The paper tackles rapid benchmark saturation in large language models by introducing adversarial encodings through paired questions and expanded answer options. It systematically evaluates these encodings on WMDP-bio, GPQA, and MMLU-Pro, quantifying absolute and relative performance drops and testing mitigation via alternative formats and fine-tuning. Key findings show that paired questions and more options reliably depress performance for capable models, effectively unsaturating benchmarks and enabling a second life for older tests (exemplified by Re-MMLU and tinyRe-MMLU). The work highlights practical implications for benchmark design and proposes future directions, including additional encodings and assessment of reasoning-model resilience to adversarial encodings.

Abstract

Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.

Paper Structure

This paper contains 33 sections, 9 figures.

Figures (9)

  • Figure 1: Example of converting two single questions into a paired multiple-choice question. See \ref{['sample paired question']} for a concrete example from our benchmark.
  • Figure 2: Relative performance drop on paired questions for WMDP-bio, GPQA and MMLU-Pro.
  • Figure 3: Alternative method for pairing questions. The resulting benchmark has the square of the original answer options. See \ref{['sample alternative paired question']} for a concrete example.
  • Figure 4: Relative performance drops on two versions of paired-question WMDP-bio, calculated per individual question.
  • Figure 5: Relative performance drops on paired-question benchmarks: original vs. WMDP-chem-fine-tuned GPT-4o and GPT-4o mini.
  • ...and 4 more figures