Table of Contents
Fetching ...

Mediocrity is the key for LLM as a Judge Anchor Selection

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend

Abstract

The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

Mediocrity is the key for LLM as a Judge Anchor Selection

Abstract

The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
Paper Structure (25 sections, 4 equations, 22 figures, 10 tables)

This paper contains 25 sections, 4 equations, 22 figures, 10 tables.

Figures (22)

  • Figure 1: Kendall’s $\tau$ correlation ($\tau_{p, \mathcal{A}}$) plotted against anchor position. The y-axis shows the correlation between the anchor-based ranking and the quadratic ranking $\pi_{quad}$, while the x-axis represents the anchor's position (rank) in $\pi_{quad}$. This reveals an inverted U-shaped relationship: top and bottom-ranked models correlate poorly with the gold standard, making them suboptimal anchors. The judge is Deepseek-v3.
  • Figure 2: Histograms of the frequency of samples (Y-axis) grouped by the number of models that outperformed the anchor (X-axis). A value of 0 on the X-axis indicates samples where the anchor was superior to all other models, while higher values indicate samples where the anchor was frequently outperformed. o3 (\ref{['fig:o3_dist']}) shows a positive skew, as most of the data points are clustered on the left, in accordance with o3 being a strong model that usually beats its opponents. Respectively, we get a negative skew for the low performing Llama 4 Maverick Instruct (\ref{['fig:llama_dist']}). For Gemma 3 27B-Instruct (\ref{['fig:gemma_dist']}) we get a more evenly spread “flatter” distribution.
  • Figure 3: Kendall’s $\tau$ correlation ($\tau_{p, \mathcal{A}}$) plotted against anchor informativeness. The y-axis shows the correlation between the anchor-based ranking and the quadratic ranking $\pi_{quad}$, while the x-axis represents the anchor's informativeness $I(p, \mathcal{A)}$. The plot exhibits a positive correlation between anchor quality and anchor informativeness. The judge is Deepseek-v3.
  • Figure 4: Mean $\tau_{p, \mathcal{A}}$ with respect to human ranking averaged over random sample selections as a function of sample size. As the number of samples grows, the variance of the quadratic evaluation correlation decreases. Simultaneously, the mean anchor-based correlation improves, eventually converging with the quadratic correlation at approximately $600$ samples. This is not the case for each particular anchor choice, see o3 correlation. This demonstrates that anchor-based ranking is more affected by the dataset size than the quadratic ranking. The judge is Deepseek-v3.
  • Figure 5: Decision tree for good pairwise evaluation.
  • ...and 17 more figures