Table of Contents
Fetching ...

Leaderboard Incentives: Model Rankings under Strategic Post-Training

Yatong Chen, Guanhua Zhang, Moritz Hardt

TL;DR

It is proved that under mild conditions, a recently proposed evaluation protocol, called tune-before-test, induces a benchmark with a unique Nash equilibrium that ranks models by latent quality, demonstrating that benchmarks need not set bad incentives, even if current evaluations do.

Abstract

Influential benchmarks incentivize competing model developers to strategically allocate post-training resources toward improvements on the leaderboard, a phenomenon dubbed benchmaxxing or training on the test task. In this work, we initiate a principled study of the incentive structure that benchmarks induce. We model benchmarking as a Stackelberg game between a benchmark designer who chooses an evaluation protocol and multiple model developers who compete simultaneously in a subgame given by the designer's choice. Each competitor has a model of unknown latent quality and can inflate its observed score by allocating resources to benchmark-specific improvements. First, we prove that current benchmarks induce games for which no Nash equilibrium between model developers exists. This result suggests one explanation for why current practice leads to misaligned incentives, prompting model developers to strategize in opaque ways. However, we prove that under mild conditions, a recently proposed evaluation protocol, called tune-before-test, induces a benchmark with a unique Nash equilibrium that ranks models by latent quality. This positive result demonstrates that benchmarks need not set bad incentives, even if current evaluations do.

Leaderboard Incentives: Model Rankings under Strategic Post-Training

TL;DR

It is proved that under mild conditions, a recently proposed evaluation protocol, called tune-before-test, induces a benchmark with a unique Nash equilibrium that ranks models by latent quality, demonstrating that benchmarks need not set bad incentives, even if current evaluations do.

Abstract

Influential benchmarks incentivize competing model developers to strategically allocate post-training resources toward improvements on the leaderboard, a phenomenon dubbed benchmaxxing or training on the test task. In this work, we initiate a principled study of the incentive structure that benchmarks induce. We model benchmarking as a Stackelberg game between a benchmark designer who chooses an evaluation protocol and multiple model developers who compete simultaneously in a subgame given by the designer's choice. Each competitor has a model of unknown latent quality and can inflate its observed score by allocating resources to benchmark-specific improvements. First, we prove that current benchmarks induce games for which no Nash equilibrium between model developers exists. This result suggests one explanation for why current practice leads to misaligned incentives, prompting model developers to strategize in opaque ways. However, we prove that under mild conditions, a recently proposed evaluation protocol, called tune-before-test, induces a benchmark with a unique Nash equilibrium that ranks models by latent quality. This positive result demonstrates that benchmarks need not set bad incentives, even if current evaluations do.
Paper Structure (46 sections, 9 theorems, 63 equations, 3 figures, 1 table)

This paper contains 46 sections, 9 theorems, 63 equations, 3 figures, 1 table.

Key Result

Proposition 4.3

Under ass:cost and ass:post-effort-score, fix any tune-before-test adjustment level $\Delta^{\textit{tbt}}\geq 0$. If the follower game admits a pure-Nash equilibrium $\textbf{e}^*$, then for any $i,j$, In particular, post-effort scores at equilibrium preserve the latent capability ordering up to ties.

Figures (3)

  • Figure 1: Left: Continued post-training trajectories of Qwen2.5 models of different sizes on Winogrande. Here, we use model size as a proxy for the model’s latent capability $\theta$. The $x$-axis denotes the amount of post-training steps (each step corresponds to 8 data points), reflecting post-training effort $e$. The $y$-axis denotes accuracy on the validation set, i.e., $v(\theta, e)$. For each model, we fit a curve following \ref{['eq:log-score']}. The empirical results align with the assumptions of monotonicity in capability, diminishing returns and saturation in effort, and non-decreasing effort gaps in Assumption \ref{['ass:post-effort-score']}. See Appendix \ref{['app:empirical']} for additional details and results for the other eight benchmarks. Right: For each tune-before-test level $\Delta^{\textit{tbt}}$ (the amount of benchmark-specific finetuning steps, $x$-axis), we calculate the minimal additional steps required ($y$-axis) to change the ranking for at least one model, i.e., $\min_{r\in \{2,\ldots,n\}} {e}^{\text{req}}_r(\Delta^{\textit{tbt}})$, based on the fitted curves on the left. With $\Delta^{\textit{tbt}}=3,000$, at least 384,668 training steps are needed to change the ranking of one model.
  • Figure 2: Illustration of the notations and functions used in the proof of \ref{['prop:order-preserve']}. The post-effort score function $v(\theta, e)$ satisfies the conditions in \ref{['ass:post-effort-score']}. Here, $e_i^*$ and $e_j^*$ denote the equilibrium efforts of model developers $i$ and $j$, respectively. The counterfactual efforts $\tilde{e}_i$ and $\tilde{e}_j$ are defined as the efforts each model developer would need to match the other model developer's post-effort score given a TbT effort level $\Delta^{\textit{tbt}}$, i.e., $v(\theta_i, \tilde{e}_i + \Delta^{\textit{tbt}} ) = v(\theta_j, e_j^* + \Delta^{\textit{tbt}}), v(\theta_j, \tilde{e}_j + \Delta^{\textit{tbt}}) = v(\theta_i, e_i^* + \Delta^{\textit{tbt}}).$
  • Figure 3: Continued post-training trajectories of Qwen models of different sizes on nine benchmarks. Here, we use model size as a proxy for the model’s latent capability $\theta$. The $x$-axis denotes the number of post-training steps, reflecting post-training effort $e$. The $y$-axis denotes accuracy on the validation set, i.e., $v(\theta, e)$. For each model, we fit a curve following Equation equation \ref{['eq:log-score']}.

Theorems & Definitions (28)

  • Definition 2.1: Post-Effort Score
  • Example 2.2
  • Definition 3.1: Stackelberg Ranking Game
  • Definition 3.2: Model Developer's Utility
  • Definition 3.3: Follower Game's Nash Equilibrium
  • Definition 3.4
  • Proposition 4.3
  • proof : Proof sketch
  • Definition 4.4: Just-Overtake Effort at TbT Level $\Delta^{\textit{tbt}}$
  • Proposition 4.5: Zero-effort equilibrium condition
  • ...and 18 more