Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu
TL;DR
This work tackles the verification bottleneck in research-level mathematics by introducing Consequence-Based Utility (CBU), an oracle-free evaluator that scores candidate solutions by their downstream usefulness on a neighborhood of verifiable problems, formalized as $U(C)=\frac{1}{|\mathcal{N}(Q)|}\sum_{Q^*\in\mathcal{N}(Q)} \mathbb{E}_{\tilde{C}\sim M_\theta(\cdot|Q,C,Q^*)}[v(Q^*,\tilde{C})]$. It builds ExpertMath, a dataset of 192 expert-written problems and 425 LLM-generated problems, to benchmark validation methods and demonstrate that CBU consistently surpasses LLM judges and reward-model baselines across multiple backbones, with notable gains on hard problems (e.g., Acc@1 and AUC improvements for GPT-OSS-120B from 67.21 to 76.27 and 71.42 to 79.63, respectively). The approach emphasizes downstream transfer via neighborhood questions and in-context learnability as a correctness signal, providing a practical guide for constructing neighborhoods and budgeting rollouts. The work further discusses limitations, such as neighborhood construction requirements, and outlines extensions to other STEM domains and automated neighborhood generation for broader applicability.
Abstract
Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
