Training on the Test Task Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo, Florian E. Dorner, Moritz Hardt
TL;DR
This work reveals that training models on test-task information can substantially inflate benchmark scores, confounding cross-model comparisons and claims of emergent capabilities. By proposing a simple adjustment—finetuning every model on a sufficient, task-relevant dataset before evaluation—the authors demonstrate that performance gaps attributed to newer models largely evaporate, bringing models of different ages onto a common footing. They show that reformulating benchmarks and controlling for TT exposure also reproduces these confounds, while providing robust demonstrations across 56 models and two major benchmarks (MMLU and GSM8K). The findings imply that benchmark progress has been overstated and that emergence can be shifted to smaller scales with TT-aware evaluation, offering a practical path toward fairer, more predictive scaling laws. The work calls for a reorientation of evaluation paradigms to account for TT training as a core determinant of benchmark performance and emergence.
Abstract
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data prior to evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models, with broad implications for benchmarking and the study of emergent capabilities.
