Table of Contents
Fetching ...

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Andreas Stephan, Dawei Zhu, Matthias Aßenmacher, Xiaoyu Shen, Benjamin Roth

TL;DR

The paper investigates how large language model (LLM) judges perform on mathematical reasoning tasks, using multiple large and small models across three datasets with verifiable solutions. It shows that judge performance correlates with candidate task performance, indicating bias toward higher-quality models, and that a substantial portion of judgments can be predicted from simple linguistic features like POS-Ngrams. The authors analyze both population- and sample-level dynamics, finding that judges reliably rank higher-quality models but do not reliably improve task performance; practical guidance favors using judges as answer generators with majority voting. Overall, the work highlights systematic biases in LLM judges and emphasizes careful application, while outlining avenues to better understand and harness judge behavior in verifiable domains.

Abstract

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance.

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

TL;DR

The paper investigates how large language model (LLM) judges perform on mathematical reasoning tasks, using multiple large and small models across three datasets with verifiable solutions. It shows that judge performance correlates with candidate task performance, indicating bias toward higher-quality models, and that a substantial portion of judgments can be predicted from simple linguistic features like POS-Ngrams. The authors analyze both population- and sample-level dynamics, finding that judges reliably rank higher-quality models but do not reliably improve task performance; practical guidance favors using judges as answer generators with majority voting. Overall, the work highlights systematic biases in LLM judges and emphasizes careful application, while outlining avenues to better understand and harness judge behavior in verifiable domains.

Abstract

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance.
Paper Structure (55 sections, 4 equations, 9 figures, 9 tables)

This paper contains 55 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: In our problem setup two LLMs ($A$ and $B$), provide candidate answers for a math problem, and a judge LLM has to decide which one is correct. All three use chain-of-thought (CoT) reasoning wei2022chain.
  • Figure 2: Class confusion matrices per model. We observe that it is difficult for judges to detect that both answers are incorrect.
  • Figure 3: Judgment Performance $S^J_{A,B}$ of LLM judges on model pairs, averaged across datasets.
  • Figure 4: Judges' accuracy vs. performance gap between two candidate models $A$ and $B$. Each point represents a subset where $A$ is correct, and $B$ is incorrect. The color reflects the size of these subsets.
  • Figure 5: Percentage of model pairs $(A, B)$ where a judge picks a better model $A$ (meaning $S_A > S_B$), by selecting more answers of $A$ than from $B$.
  • ...and 4 more figures