Table of Contents
Fetching ...

Training on the Test Task Confounds Evaluation and Emergence

Ricardo Dominguez-Olmedo, Florian E. Dorner, Moritz Hardt

TL;DR

This work reveals that training models on test-task information can substantially inflate benchmark scores, confounding cross-model comparisons and claims of emergent capabilities. By proposing a simple adjustment—finetuning every model on a sufficient, task-relevant dataset before evaluation—the authors demonstrate that performance gaps attributed to newer models largely evaporate, bringing models of different ages onto a common footing. They show that reformulating benchmarks and controlling for TT exposure also reproduces these confounds, while providing robust demonstrations across 56 models and two major benchmarks (MMLU and GSM8K). The findings imply that benchmark progress has been overstated and that emergence can be shifted to smaller scales with TT-aware evaluation, offering a practical path toward fairer, more predictive scaling laws. The work calls for a reorientation of evaluation paradigms to account for TT training as a core determinant of benchmark performance and emergence.

Abstract

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data prior to evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models, with broad implications for benchmarking and the study of emergent capabilities.

Training on the Test Task Confounds Evaluation and Emergence

TL;DR

This work reveals that training models on test-task information can substantially inflate benchmark scores, confounding cross-model comparisons and claims of emergent capabilities. By proposing a simple adjustment—finetuning every model on a sufficient, task-relevant dataset before evaluation—the authors demonstrate that performance gaps attributed to newer models largely evaporate, bringing models of different ages onto a common footing. They show that reformulating benchmarks and controlling for TT exposure also reproduces these confounds, while providing robust demonstrations across 56 models and two major benchmarks (MMLU and GSM8K). The findings imply that benchmark progress has been overstated and that emergence can be shifted to smaller scales with TT-aware evaluation, offering a practical path toward fairer, more predictive scaling laws. The work calls for a reorientation of evaluation paradigms to account for TT training as a core determinant of benchmark performance and emergence.

Abstract

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data prior to evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models, with broad implications for benchmarking and the study of emergent capabilities.
Paper Structure (42 sections, 5 equations, 21 figures, 3 tables)

This paper contains 42 sections, 5 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: MMLU and GSM8K scores of 56 base models, with model sizes ranging from 70M to 70B parameters. Solid lines correspond to the regression fit of $A = \alpha\max(0, \log C - c_e) + \theta N + r$, where $A$ is accuracy, $C$ is pretraining compute, $N$ is whether the model was trained after November 2023, and $r$ is random chance accuracy. The coefficient $\theta$ denotes the average improvement of models trained after November 2023 when controlling for pretraining compute. Bold indicates statistical significance with $p$-value $<0.05$. (Top) We hypothesize that training on the test task confounds benchmark evaluations, resulting in newer base models substantially outperforming older ones. (Bottom) We propose to adjust for differences in test task training by fine-tuning all models on the same, sufficient amount of task-specific data before evaluation. After fine-tuning on the test task, differences in benchmark performance between older and newer models vanish.
  • Figure 2: Models trained before November 2023 tend to benefit much more from fine-tuning on task data.
  • Figure 3: Models trained before November 2023 (●) without fine-tuning and (●) after fine-tuning on the test task. Their difference in benchmark performance $\widehat{\theta}$ resembles that between newer and older models. After adjusting by training on the test task, their difference vanishes. Bold indicates significance with $p<0.05$.
  • Figure 4: Reformulating ARC and HellaSwag as MMLU-style questions give rise to large differences $\widehat{\theta}$ between models trained (●) before November 2023 and (●) after November 2023. After adjusting by fine-tuning on the test task, differences in performance vanish. Bold indicates significance with $p<0.05$.
  • Figure 5: When evaluating MMLU using "cloze" prompts, models trained (●) after November 2023 no longer outperform those trained (●) before November 2023 (middle). When using Brier score as the evaluation metric, we still observe sharp improvements in performance between $10^{22}$ and $10^{23}$ FLOPs (right).
  • ...and 16 more figures