Table of Contents
Fetching ...

Graders should cheat: privileged information enables expert-level automated evaluations

Jin Peng Zhou, Sébastien M. R. Arnold, Nan Ding, Kilian Q. Weinberger, Nan Hua, Fei Sha

TL;DR

Privileged information (PI) is proposed to augment LM graders for evaluating frontier tasks where LMs lag behind humans. The approach leverages ground-truth solutions, rating guidelines, prior ratings, search results, and multimodal annotations, and further derives hints from PI to adjust problem difficulty. Across RewardBench, Vibe-Eval, and MathOdyssey, PI-augmented graders achieve state-of-the-art or near-expert performance, outperforming human graders in some cases. The paper also analyzes biases and shows that PI can reduce verbosity and formatting biases while enabling tiered evaluations for deeper model comparison. This yields a scalable path toward reliable automated evaluation of advanced AI systems.

Abstract

Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today's LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains. We show that providing privileged information -- such as ground-truth solutions or problem-specific guidelines -- improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged information can be used to devise easier variations of challenging problems which improves the separability of different LMs on tasks where their performance is generally low. With this approach, general-purpose LM graders match the state of the art performance on RewardBench, surpassing almost all the specially-tuned models. LM graders also outperform individual human raters on Vibe-Eval, and approach human expert graders on Olympiad-level math problems.

Graders should cheat: privileged information enables expert-level automated evaluations

TL;DR

Privileged information (PI) is proposed to augment LM graders for evaluating frontier tasks where LMs lag behind humans. The approach leverages ground-truth solutions, rating guidelines, prior ratings, search results, and multimodal annotations, and further derives hints from PI to adjust problem difficulty. Across RewardBench, Vibe-Eval, and MathOdyssey, PI-augmented graders achieve state-of-the-art or near-expert performance, outperforming human graders in some cases. The paper also analyzes biases and shows that PI can reduce verbosity and formatting biases while enabling tiered evaluations for deeper model comparison. This yields a scalable path toward reliable automated evaluation of advanced AI systems.

Abstract

Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today's LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains. We show that providing privileged information -- such as ground-truth solutions or problem-specific guidelines -- improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged information can be used to devise easier variations of challenging problems which improves the separability of different LMs on tasks where their performance is generally low. With this approach, general-purpose LM graders match the state of the art performance on RewardBench, surpassing almost all the specially-tuned models. LM graders also outperform individual human raters on Vibe-Eval, and approach human expert graders on Olympiad-level math problems.

Paper Structure

This paper contains 15 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: On Vibe-Eval, graders with privileged information outperform individual human graders. Spearman correlation is measured against the average vote of 5 human graders. Left: Both Gemini 1.5 Flash and Pro can outperform individual human graders, and they both perform best when given different sources of privileged information. Middle: Individual humans also benefit from privileged information, albeit not as much as automatic graders. Right: Gemini 1.5 Pro benefits from privileged information especially on the Hard split of Vibe-Eval, indicating privileged information is especially useful for frontier benchmarks.
  • Figure 2: Automatic graders augmented with privileged information. The blue boxes represents the typical LM grader pipeline, where two models $A$ and $B$ respond to a prompt. The grader is tasked to decide which of response $A$ or $B$ is best, or if it's a tie. We propose to equip the grader with prompt-specific privileged information to ease the evaluation task, here a short derivation with ground-truth solution. See \ref{['sec:method']} for a more detailed description.
  • Figure 3: Hints improve separation on frontier problems. On MATH-Adv and GPQA, giving no hint results in too difficult problems while giving all hints makes the problems too easy. In both cases we need 1 or 2 hints to reliably separate candidate models. Thus hints synthesized from PI effectively interpolate the difficulty of frontier problems, which helps separate weaker models from stronger ones.
  • Figure 4: Tiered difficulty analysis. Hints synthesized from privileged information enable a "tiered" analysis, where we can compare models on the same problems at different difficulty levels. Our analysis sheds light on a previously unknown result: comparatively, Gemini models shine on easier problems whereas GPT-4o is competitive on more difficult problems.
  • Figure 5: Automatic graders significantly benefit from privileged information to evaluate Olympiad-level math problems. On the Olympiad subset of MathOdyssey, the Spearman correlation between LM and expert human graders improves by as much as $0.37$ points with privileged information. Overall, the best LM grader reaches up to $0.71$ Spearman correlation, approaching the quality of human experts. Lightweight models (like Gemma 2 27B and Gemini Flash) especially benefit from privileged information and decisively outperform all other LM graders if they don't use privileged information.
  • ...and 6 more figures