Table of Contents
Fetching ...

Improving Value-based Process Verifier via Structural Prior Injection

Zetian Sun, Dongfang Li, Baotian Hu, Jun Yu, Min Zhang

TL;DR

The paper tackles noise in Monte Carlo value estimation for LLM reasoning by introducing a structural prior that recasts the scalar state value as the expectation of a predefined categorical distribution, enabling distribution-focused optimization. It defines a Statistics-based Distance to guide posterior distribution selection and explores two optimization paths: mean-squared error over the bin expectations and Histogram Loss with a ground-truth-like posterior, including Binomial$(k,p)$ modeling. Across Best-of-N and beam-search tasks on the MATH dataset, the approach yields consistent 1–2 point gains and shows that the choice of prior distribution substantially impacts performance. The work highlights the potential of distribution-aware priors in value-based process verifiers and outlines directions for differentiable distance metrics and broader priors to further improve reasoning reliability.

Abstract

In the Large Language Model(LLM) reasoning scenario, people often estimate state value via Monte Carlo sampling. Though Monte Carlo estimation is an elegant method with less inductive bias, noise and errors are inevitably introduced due to the limited sampling. To handle the problem, we inject the structural prior into the value representation and transfer the scalar value into the expectation of a pre-defined categorical distribution, representing the noise and errors from a distribution perspective. Specifically, by treating the result of Monte Carlo sampling as a single sample from the prior ground-truth Binomial distribution, we quantify the sampling error as the mismatch between posterior estimated distribution and ground-truth distribution, which is thus optimized via distribution selection optimization. We test the performance of value-based process verifiers on Best-of-N task and Beam search task. Compared with the scalar value representation, we show that reasonable structural prior injection induced by different objective functions or optimization methods can improve the performance of value-based process verifiers for about 1$\sim$2 points at little-to-no cost. We also show that under different structural prior, the verifiers' performances vary greatly despite having the same optimal solution, indicating the importance of reasonable structural prior injection.

Improving Value-based Process Verifier via Structural Prior Injection

TL;DR

The paper tackles noise in Monte Carlo value estimation for LLM reasoning by introducing a structural prior that recasts the scalar state value as the expectation of a predefined categorical distribution, enabling distribution-focused optimization. It defines a Statistics-based Distance to guide posterior distribution selection and explores two optimization paths: mean-squared error over the bin expectations and Histogram Loss with a ground-truth-like posterior, including Binomial modeling. Across Best-of-N and beam-search tasks on the MATH dataset, the approach yields consistent 1–2 point gains and shows that the choice of prior distribution substantially impacts performance. The work highlights the potential of distribution-aware priors in value-based process verifiers and outlines directions for differentiable distance metrics and broader priors to further improve reasoning reliability.

Abstract

In the Large Language Model(LLM) reasoning scenario, people often estimate state value via Monte Carlo sampling. Though Monte Carlo estimation is an elegant method with less inductive bias, noise and errors are inevitably introduced due to the limited sampling. To handle the problem, we inject the structural prior into the value representation and transfer the scalar value into the expectation of a pre-defined categorical distribution, representing the noise and errors from a distribution perspective. Specifically, by treating the result of Monte Carlo sampling as a single sample from the prior ground-truth Binomial distribution, we quantify the sampling error as the mismatch between posterior estimated distribution and ground-truth distribution, which is thus optimized via distribution selection optimization. We test the performance of value-based process verifiers on Best-of-N task and Beam search task. Compared with the scalar value representation, we show that reasonable structural prior injection induced by different objective functions or optimization methods can improve the performance of value-based process verifiers for about 12 points at little-to-no cost. We also show that under different structural prior, the verifiers' performances vary greatly despite having the same optimal solution, indicating the importance of reasonable structural prior injection.

Paper Structure

This paper contains 25 sections, 20 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: The ablation study about the posterior distribution selection. We compare different posterior distributions, measuring them based on the statistics-based distance, and the Best-of-N performance in MATH500.
  • Figure 2: The ablation study about the categorical distribution selection. We compare different categorical distributions varying on the Dirac delta function definition and the category quantity. We report the Best-of-N performance in MATH500.
  • Figure 3: The entropy of the estimated state values on the verifier's training dataset.

Theorems & Definitions (1)

  • Definition 3.1