Table of Contents
Fetching ...

Improving Value-based Process Verifier via Low-Cost Variance Reduction

Zetian Sun, Dongfang Li, Baotian Hu, Min Zhang

TL;DR

This work tackles the high variance problem in Monte Carlo estimates used to train value-based process verifiers for LLM reasoning. It introduces Compound Monte Carlo Sampling (ComMCS), a linear combination of current and future-step estimators that preserves unbiasedness while reducing variance without extra LLM calls. The authors provide a theoretical analysis showing unbiasedness and variance reduction under mild conditions, and implement practical approximations via one-step value distribution modeling using a Gaussian-capped categorical approach. Empirical results on MATH-500 and GSM8K demonstrate consistent improvements over regression and non-variance baselines, including up to 2.8-point gains in Best-of-$N$ evaluations, validating the method’s effectiveness and practicality for improving mathematical reasoning with verifiers. Overall, ComMCS offers a cost-efficient, principled path to more reliable value-based verification in complex reasoning tasks, with potential applicability to other domains such as code generation and beyond.

Abstract

Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte \textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.

Improving Value-based Process Verifier via Low-Cost Variance Reduction

TL;DR

This work tackles the high variance problem in Monte Carlo estimates used to train value-based process verifiers for LLM reasoning. It introduces Compound Monte Carlo Sampling (ComMCS), a linear combination of current and future-step estimators that preserves unbiasedness while reducing variance without extra LLM calls. The authors provide a theoretical analysis showing unbiasedness and variance reduction under mild conditions, and implement practical approximations via one-step value distribution modeling using a Gaussian-capped categorical approach. Empirical results on MATH-500 and GSM8K demonstrate consistent improvements over regression and non-variance baselines, including up to 2.8-point gains in Best-of- evaluations, validating the method’s effectiveness and practicality for improving mathematical reasoning with verifiers. Overall, ComMCS offers a cost-efficient, principled path to more reliable value-based verification in complex reasoning tasks, with potential applicability to other domains such as code generation and beyond.

Abstract

Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte \textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.

Paper Structure

This paper contains 39 sections, 3 theorems, 30 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 4.1

(Equivalence of MC Value Estimation for Binary Returns and Binomial Distribution) Suppose the minimal support set of the return distribution is $\{0,1\}$. Let $V^\pi(s)$ be the true state value for given policy $\pi$, i.e., the expected return starting from state $s$ and policy $\pi$. Suppose we est Then, the MC estimation process is probabilistically equivalent to sampling from a binomial distrib

Figures (8)

  • Figure 1: Illustration of the estimation variance across different ground-truth values. We compare variances with number of trials ranging from $\{8, 10, 16\}$, and variance applying ComMCS. The variance after using our method (8-trials, ComMCS) is approximately the variance obtained by using 25% more sampling samples (10-trials).
  • Figure 2: Illustration of our proposed ComMCS compared with baseline optimization methods, as discussed in § \ref{['sec:method']}. Given any reasoning trajectory, the trajectory can be divided into several reasoning steps (top left). The value of each reasoning step is estimated by Monte Carlo sampling, which is the average sum of the outcome reward of each reasoning trajectory (top right, § \ref{['sec:method_sec1']}). : Baseline optimization methods use MSE loss or BCE loss. These methods are based on regression and return distribution modeling respectively, which are trained on the estimated value of current state, i.e., $\hat{V}^\pi(s_n)$. : Our method, aiming at reducing the variance when perform MC estimation, is based on variance comparison (§ \ref{['sec:method_sec2']}) and one-step value distribution modeling (§ \ref{['sec:practical_approximation']}), and is trained on the estimated value of current state and next state, i.e., $\hat{V}^\pi(s_n)$ and $\hat{V}^\pi(s_{n+1})$.
  • Figure 3: Illustration of the MC estimation of state value and the MDP condition in mathematical reasoning scenario. For any state $s_t$, it is a concatenation of the last state $s_{t-1}$ and last action $a_{t-1}$. For each action $a_t$, it is the atomic reasoning step. We mark the first action $a_1$ with underline. The first state is the question $q$, as defined in §\ref{['sec:mdp']}. We use the brackets "[]" and semicolon ";" to denote the concatenation operation between $s_{t-1}$ and $a_{t-1}$. The state value is calculated at the end position of each state, i.e., the "< request >" token position.
  • Figure 4: Visualization of the estimation one-step value distribution given problem "Simplify $\frac{1}{5}\cdot \frac{8}{7}\div \frac{12}{20}$.".
  • Figure 5: Visualization of the estimation one-step value distribution given problem "If $n \equiv 2 \pmod{7}$, then find the remainder when $(n + 2)(n + 4)(n + 6)$ is divided by 7.". The steps after the 6-th step are excluded to save space. Their values concentrate at the position of 0.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Definition 4.4
  • proof
  • proof
  • proof