Table of Contents
Fetching ...

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown

TL;DR

SKATE introduces a scalable, automated framework for evaluating evolving LLMs by letting models act as both task-setters and solvers in a peer-challenge, verifiable-task game. It uses Code-Output Prediction (COP) as a concrete verifiable substrate and applies a TrueSkill ranking to quantify cross-model differences, revealing that weaker models can reliably differentiate stronger ones and exposing self-preferencing behavior. The approach remains data-free, with question clustering to maintain diversity and augmentation strategies to probe information use, enabling open-ended and scalable assessment across model progress. Overall, SKATE surfaces fine-grained capability gaps and provides a pathway for ongoing, automated oversight of LLMs as they advance.

Abstract

Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

TL;DR

SKATE introduces a scalable, automated framework for evaluating evolving LLMs by letting models act as both task-setters and solvers in a peer-challenge, verifiable-task game. It uses Code-Output Prediction (COP) as a concrete verifiable substrate and applies a TrueSkill ranking to quantify cross-model differences, revealing that weaker models can reliably differentiate stronger ones and exposing self-preferencing behavior. The approach remains data-free, with question clustering to maintain diversity and augmentation strategies to probe information use, enabling open-ended and scalable assessment across model progress. Overall, SKATE surfaces fine-grained capability gaps and provides a pathway for ongoing, automated oversight of LLMs as they advance.

Abstract

Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.

Paper Structure

This paper contains 44 sections, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: On the left: A Game of SKATE. A set of LLMs take turns to set questions for one another. Players are incentivized by their prompts to write questions which they can answer, but which their competitors cannot. In this way, the complexity of the generated questions scales with the capabilities of the setters themselves. On the right: the TrueSkill rank of each of six frontier models, based on their question answering ability, is initially uncertain and game outcomes are surprising. Eventually, a stable ranking emerges.
  • Figure 2: Cumulative average p(correct) values per model. Lines are different lengths depending on how many valid, unique COP questions each model was able to create in the 50 rounds.
  • Figure 3: Difference in average p(correct) scores between answering model and all other players. Positive values imply a model scores higher on average than its competitors. Highlighted cells are the maxima in each row. In (b) we observe close to maximal entries along the diagonal: with the filter in place, models perform best on their own questions. Note that there are "no valid questions" for Haiku 3.5 after applying the filter: it fails to write any questions which it can answer sufficiently well.
  • Figure 4: In (a), four "weaker" agents play a Game of SKATE. In panel (b), we use questions from these four models to rank two new "stronger" models (Sonnet 3.5 and Sonnet 4). In panel (c), Sonnet 3.5 joins the Game and sets its own questions, which all six models answer. In panel (d) Sonnet 4.0 also joins the Game and sets questions of its own.
  • Figure 5: MCQA choice and ordering effects. Both the option set and its ordering have a large influence on the measured LLM "correctness". To account for this, we propose Algorithm \ref{['algo:MCalgorithm']}.
  • ...and 6 more figures