Table of Contents
Fetching ...

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu

TL;DR

This work tackles the verification bottleneck in research-level mathematics by introducing Consequence-Based Utility (CBU), an oracle-free evaluator that scores candidate solutions by their downstream usefulness on a neighborhood of verifiable problems, formalized as $U(C)=\frac{1}{|\mathcal{N}(Q)|}\sum_{Q^*\in\mathcal{N}(Q)} \mathbb{E}_{\tilde{C}\sim M_\theta(\cdot|Q,C,Q^*)}[v(Q^*,\tilde{C})]$. It builds ExpertMath, a dataset of 192 expert-written problems and 425 LLM-generated problems, to benchmark validation methods and demonstrate that CBU consistently surpasses LLM judges and reward-model baselines across multiple backbones, with notable gains on hard problems (e.g., Acc@1 and AUC improvements for GPT-OSS-120B from 67.21 to 76.27 and 71.42 to 79.63, respectively). The approach emphasizes downstream transfer via neighborhood questions and in-context learnability as a correctness signal, providing a practical guide for constructing neighborhoods and budgeting rollouts. The work further discusses limitations, such as neighborhood construction requirements, and outlines extensions to other STEM domains and automated neighborhood generation for broader applicability.

Abstract

Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

TL;DR

This work tackles the verification bottleneck in research-level mathematics by introducing Consequence-Based Utility (CBU), an oracle-free evaluator that scores candidate solutions by their downstream usefulness on a neighborhood of verifiable problems, formalized as . It builds ExpertMath, a dataset of 192 expert-written problems and 425 LLM-generated problems, to benchmark validation methods and demonstrate that CBU consistently surpasses LLM judges and reward-model baselines across multiple backbones, with notable gains on hard problems (e.g., Acc@1 and AUC improvements for GPT-OSS-120B from 67.21 to 76.27 and 71.42 to 79.63, respectively). The approach emphasizes downstream transfer via neighborhood questions and in-context learnability as a correctness signal, providing a practical guide for constructing neighborhoods and budgeting rollouts. The work further discusses limitations, such as neighborhood construction requirements, and outlines extensions to other STEM domains and automated neighborhood generation for broader applicability.

Abstract

Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
Paper Structure (36 sections, 24 equations, 8 figures, 6 tables)

This paper contains 36 sections, 24 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Consequence-Based Utility for solution validation. We use GPT-OSS-120B as the solver $M_\theta$ and score each candidate solution by its induced accuracy on neighborhood questions $Q^*$; $U(C^1) > U(C^2)$ suggests $C^1$ is more likely correct.
  • Figure 2: Example of a target question, candidate solutions, and neighborhood questions from ExpertMath . (A) A target research-level problem on the asymptotic Hecke algebra $J$ of the Coxeter group of type $D_8$. (B) A fixed candidate pool $C^{1:3}$ illustrating three typical solution types appearing in our dataset: an expert-written correct solution $C^1$; an LLM-generated solution that is mathematically correct $C^2$; and a plausible but incorrect LLM-generated solution $C^3$ that makes a subtle conceptual error by conflating the number of left Kazhdan–Lusztig cells with the number of irreducible representations. (C) Two neighborhood questions $Q^*$ derived from $Q$ by modifying the Coxeter type or the associated invariant.
  • Figure 2: Validator performance on ranking LLM solutions. Consequence-Based Utility shows the highest performance across all metrics. Best models are highlighted in bold, second best is underlined.
  • Figure 3: Mean score gap (correct - wrong) versus question difficulty for LLM-Judge and Consequence-Based Utility.
  • Figure 4: Illustrative excerpts from incorrect solutions of each error category. Each row shows a representative quoted snippet (top) and a brief explanation of why it is incorrect or insufficient (bottom). We use four non-exclusive labels: incorrect reasoning, unjustified compression, unjustified interpretation, and external references.
  • ...and 3 more figures