Table of Contents
Fetching ...

Train-before-Test Harmonizes Language Model Rankings

Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

TL;DR

The paper addresses inconsistencies in language-model benchmarking by proposing train-before-test, where every model undergoes identical, task-specific fine-tuning before evaluation to measure model potential. Across 24 benchmarks and 61 models, this approach yields high cross-benchmark agreement (Kendall's $\tau$ rising from 0.52 to 0.76), restores alignment between perplexity and downstream performance, and reveals a predominantly rank-one structure in the model-score matrix driven by a single latent factor related to pre-training compute. The findings suggest that potential, not just out-of-the-box performance, better captures a model's transferability and development prospects, with significant implications for benchmarking practices and model selection. Adopting train-before-test as a standard alongside direct evaluation could improve reliability, interpretability, and practical utility in the model evaluation ecosystem.

Abstract

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

Train-before-Test Harmonizes Language Model Rankings

TL;DR

The paper addresses inconsistencies in language-model benchmarking by proposing train-before-test, where every model undergoes identical, task-specific fine-tuning before evaluation to measure model potential. Across 24 benchmarks and 61 models, this approach yields high cross-benchmark agreement (Kendall's rising from 0.52 to 0.76), restores alignment between perplexity and downstream performance, and reveals a predominantly rank-one structure in the model-score matrix driven by a single latent factor related to pre-training compute. The findings suggest that potential, not just out-of-the-box performance, better captures a model's transferability and development prospects, with significant implications for benchmarking practices and model selection. Adopting train-before-test as a standard alongside direct evaluation could improve reliability, interpretability, and practical utility in the model evaluation ecosystem.

Abstract

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

Paper Structure

This paper contains 27 sections, 15 figures, 4 tables, 3 algorithms.

Figures (15)

  • Figure 1: Rankings of 61 language models on two question-answering benchmarks: Natural Questions Open and ARC Challenge. Left: Direct evaluation leads to inconsistent rankings. Although both benchmarks test for question-answering ability, the resulting model rankings show substantial disagreement. Right: Train-before-test aligns model rankings. Note: For each of the two plots, we greedily align model rankings as much as possible without violating confidence intervals, thus revealing only those ranking changes that are statistically significant. See Appendix \ref{['app:banner']} for more details.
  • Figure 2: Mean ranking agreement between each benchmark and all others. We calculate Kendall’s $\tau$ between each benchmark and every other benchmark, and then average the results. Compared to direct evaluation, train-before-test consistently improves ranking agreement, often by a large margin. A detailed comparison of Kendall's $\tau$ values for every benchmark pair is provided in Appendix \ref{['app:cross_task_agree']}. On average, the overall average Kendall’s $\tau$ is 0.52 for direct evaluation and 0.76 for train-before-test.
  • Figure 3: Cross-category ranking agreement for direct evaluation (left) and train-before-test (right). We categorize benchmarks into language understanding (LU), commonsense reasoning (CR), question answering (QA), physics/biology/chemistry (PBC), math (Math), and medicine (Med), see Table \ref{['tab:benchmark_categories']}. Kendall’s $\tau$ is averaged across all pairs of benchmarks that belong to two given categories. The diagonal entries represent intra-category agreement and the others represent inter-category agreement. Train-before-test improves both intra- and inter-category ranking agreement in all instances.
  • Figure 4: Ranking agreement between perplexity rankings and downstream benchmark rankings under direct evaluation (top) and train-before-test (bottom). Perplexity rankings are consistent with each other under both evaluation schemes, with an average Kendall's $\tau$ of 0.76 and 0.78, respectively. However, for direct evaluation, agreement between perplexity rankings and downstream rankings is low, with an average Kendall's $\tau$ of just 0.48. Fortunately, train-before-test results in much higher agreement between perplexity and downstream evaluations, increasing average Kendall's $\tau$ to 0.74.
  • Figure 5: Ranking agreement between perplexity rankings before fine-tuning (direct evaluation) and downstream benchmark rankings after fine-tuning (train-before-test) for base models (top) and instruction-tuned models (bottom). Unlike Figure \ref{['fig:pp']} where both rankings in each comparison use the same evaluation scheme, here we test whether pre-fine-tuning perplexity can predict post-fine-tuning downstream performance. Base models show strong correlation (average Kendall's $\tau$ = 0.78), suggesting perplexity is a good predictor of model potential. Instruction-tuned models show much weaker correlation (average Kendall's $\tau$ = 0.51).
  • ...and 10 more figures