Improving Value-based Process Verifier via Low-Cost Variance Reduction
Zetian Sun, Dongfang Li, Baotian Hu, Min Zhang
TL;DR
This work tackles the high variance problem in Monte Carlo estimates used to train value-based process verifiers for LLM reasoning. It introduces Compound Monte Carlo Sampling (ComMCS), a linear combination of current and future-step estimators that preserves unbiasedness while reducing variance without extra LLM calls. The authors provide a theoretical analysis showing unbiasedness and variance reduction under mild conditions, and implement practical approximations via one-step value distribution modeling using a Gaussian-capped categorical approach. Empirical results on MATH-500 and GSM8K demonstrate consistent improvements over regression and non-variance baselines, including up to 2.8-point gains in Best-of-$N$ evaluations, validating the method’s effectiveness and practicality for improving mathematical reasoning with verifiers. Overall, ComMCS offers a cost-efficient, principled path to more reliable value-based verification in complex reasoning tasks, with potential applicability to other domains such as code generation and beyond.
Abstract
Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte \textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.
