Train-before-Test Harmonizes Language Model Rankings

Guanhua Zhang; Ricardo Dominguez-Olmedo; Moritz Hardt

Train-before-Test Harmonizes Language Model Rankings

Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

TL;DR

The paper addresses inconsistencies in language-model benchmarking by proposing train-before-test, where every model undergoes identical, task-specific fine-tuning before evaluation to measure model potential. Across 24 benchmarks and 61 models, this approach yields high cross-benchmark agreement (Kendall's $\tau$ rising from 0.52 to 0.76), restores alignment between perplexity and downstream performance, and reveals a predominantly rank-one structure in the model-score matrix driven by a single latent factor related to pre-training compute. The findings suggest that potential, not just out-of-the-box performance, better captures a model's transferability and development prospects, with significant implications for benchmarking practices and model selection. Adopting train-before-test as a standard alongside direct evaluation could improve reliability, interpretability, and practical utility in the model evaluation ecosystem.

Abstract

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

Train-before-Test Harmonizes Language Model Rankings

TL;DR

Abstract

Train-before-Test Harmonizes Language Model Rankings

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)