Table of Contents
Fetching ...

Heimdall: test-time scaling on the generative verification

Wenlei Shi, Xing Jin

TL;DR

Heimdall introduces a long chain-of-thought verifier trained with reinforcement learning to reliably judge solution correctness in competitive math problems. It achieves strong test-time scaling through longer reasoning and repeated verifications, and further enhances problem solving via Pessimistic Verification, which selects the most likely correct solution with quantified uncertainty across solver outputs. The approach generalizes to math proofs and supports an automatic knowledge-discovery prototype using NuminaMath data to detect flawed problems and solutions. Together, these components offer a practical framework for integrating robust verification into AI-driven reasoning and knowledge discovery, with significant implications for reliability and scalability of AI systems.

Abstract

An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.

Heimdall: test-time scaling on the generative verification

TL;DR

Heimdall introduces a long chain-of-thought verifier trained with reinforcement learning to reliably judge solution correctness in competitive math problems. It achieves strong test-time scaling through longer reasoning and repeated verifications, and further enhances problem solving via Pessimistic Verification, which selects the most likely correct solution with quantified uncertainty across solver outputs. The approach generalizes to math proofs and supports an automatic knowledge-discovery prototype using NuminaMath data to detect flawed problems and solutions. Together, these components offer a practical framework for integrating robust verification into AI-driven reasoning and knowledge discovery, with significant implications for reliability and scalability of AI systems.

Abstract

An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.

Paper Structure

This paper contains 17 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Scaling of Heimdall. Left: the verification accuracy scales with the response length during RL training. With more reasoning tokens, Heimdall gives more accurate judgment on the solutions on AIME2024. Middle: the verification accuracy scales with repeated sampling and Majority Voting. By sampling multiple verification trajectories and voting, the accuracy can be further improved. Right: with Heimdall scoring the solutions on AIME2025, the problem solving accuracy scales with the number of solutions. We verify $16$ times on each solution and select the most likely correct one with Pessimistic Verification($\times16$). When inter-playing with various solver models, Heimdall gives significant improvements over pure solver-based Majority Voting(MV).
  • Figure 2: Accuracy and response length during RL training. PPO w/o data filtering is the RL training with all problems in the dataset. Left: the accuracy on AIME2024 with the training steps. Right: the response length on the training dataset with the training steps.
  • Figure 3: The inference-time scaling of verification ability on problem solutions in AIME2024 and AIME2025. Top-left: We show the accuracy of Heimdall when we sample multiple verification responses and make the judgment by majority voting. Top-right: We show the decreasing false-negative rate(FNR) and false-positive rate(FPR) as we scale up verification responses with majority voting. Bottom-left: We calculate the average score of verification responses and draw the AUC along each number of responses. Bottom-right: We collect the verification failure cases on every math problem and draw the relation between the difficulty of the problem and the number of verification failures, which reveals that the verification difficulty may not necessarily correlate with the difficulty of the original problem.
  • Figure 4: The inference-time scaling of problem solving with Heimdall. The two figures show the accuracy on AIME datasets as the number of solutions scales up. Left: the problem solving accuracy on AIME2025 dataset scales with the number of solutions. The colored shaded area represents the area covered by the accuracy curves of a selection algorithm as the number of verifications increases from 1 to 64. Right: the contour map of the accuracy of Pessimistic Verification as the number of solutions (x-axis) and the number of verifications (y-axis) increase. The red curve indicates the optimal configurations within various overall compute budgets.
  • Figure 5: The distribution of verification scores on the problems of a synthetic dataset. The x-axis is the sum of scores across $8$ verifications and the y-axis is the number of problems corresponding to each sum.