Table of Contents
Fetching ...

Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking

Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, Andrej Risteski, Dylan J. Foster

TL;DR

This paper investigates how to reliably guide language-model decoding at test time using imperfect process verifiers. It introduces Value-Guided Sampling with Stochastic Backtracking (VGB), a backtracking-based decoding algorithm that extends the Sinclair–Jerrum random walk to general tilt functions and approximate value estimates, thereby mitigating error amplification in long-horizon generations. The authors provide rigorous uniform-error and average-case guarantees for VGB, showing it achieves fast mixing and good coverage despite imperfect verifiers, and they show how to implement it efficiently even with large action spaces. Empirically, VGB demonstrates improved distributional fidelity and coherence across synthetic and real-language tasks, including constrained generation and code/test-case generation, at the cost of additional computation. Overall, the work bridges classical approximate-sampling theory with modern test-time alignment, offering provable guarantees and practical insights for robust long-horizon reasoning with language models.

Abstract

Test-time algorithms that combine the generative power of language models with process verifiers that assess the quality of partial generations offer a promising lever for eliciting new reasoning capabilities, but the algorithmic design space and computational scaling properties of such approaches are still opaque, and their benefits are far from apparent when one accounts for the cost of learning a high-quality verifier. Our starting point is the observation that seemingly benign errors in a learned verifier can lead to catastrophic failures for standard decoding techniques due to error amplification during the course of generation. We then ask: can this be improved with more sophisticated decoding strategies? We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors. VGB interprets autoregressive generation as a random walk on a tree of partial generations, with transition probabilities guided by the process verifier and base model; crucially, backtracking occurs probabilistically. This process generalizes the seminal Sinclair-Jerrum random walk (Sinclair & Jerrum, 1989) from the literature on approximate counting and sampling in theoretical computer science, and a conceptual contribution of our work is to highlight parallels with this literature. Empirically, we demonstrate on both synthetic and real language modeling tasks that VGB outperforms baselines on a variety of metrics.

Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking

TL;DR

This paper investigates how to reliably guide language-model decoding at test time using imperfect process verifiers. It introduces Value-Guided Sampling with Stochastic Backtracking (VGB), a backtracking-based decoding algorithm that extends the Sinclair–Jerrum random walk to general tilt functions and approximate value estimates, thereby mitigating error amplification in long-horizon generations. The authors provide rigorous uniform-error and average-case guarantees for VGB, showing it achieves fast mixing and good coverage despite imperfect verifiers, and they show how to implement it efficiently even with large action spaces. Empirically, VGB demonstrates improved distributional fidelity and coherence across synthetic and real-language tasks, including constrained generation and code/test-case generation, at the cost of additional computation. Overall, the work bridges classical approximate-sampling theory with modern test-time alignment, offering provable guarantees and practical insights for robust long-horizon reasoning with language models.

Abstract

Test-time algorithms that combine the generative power of language models with process verifiers that assess the quality of partial generations offer a promising lever for eliciting new reasoning capabilities, but the algorithmic design space and computational scaling properties of such approaches are still opaque, and their benefits are far from apparent when one accounts for the cost of learning a high-quality verifier. Our starting point is the observation that seemingly benign errors in a learned verifier can lead to catastrophic failures for standard decoding techniques due to error amplification during the course of generation. We then ask: can this be improved with more sophisticated decoding strategies? We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors. VGB interprets autoregressive generation as a random walk on a tree of partial generations, with transition probabilities guided by the process verifier and base model; crucially, backtracking occurs probabilistically. This process generalizes the seminal Sinclair-Jerrum random walk (Sinclair & Jerrum, 1989) from the literature on approximate counting and sampling in theoretical computer science, and a conceptual contribution of our work is to highlight parallels with this literature. Empirically, we demonstrate on both synthetic and real language modeling tasks that VGB outperforms baselines on a variety of metrics.

Paper Structure

This paper contains 113 sections, 31 theorems, 194 equations, 21 figures, 3 tables, 8 algorithms.

Key Result

Theorem 1

For any prompt $x\in\mathcal{X}$ and $\delta>0$, under assump:unif-mgf, let $\widehat{\pi}$ be the output distribution of VGB with step count $T := \widetilde{O}([)]{ H^2 \cdot (1+\varepsilon_{V})^4 \cdot \log ( \delta^{-1})}$, and let $\mathcal{E}_{\mathsf{leaf}}$ be the event that $y \sim \widehat

Figures (21)

  • Figure 1: Left/middle: Accuracy (x) and diversity (y) of VGB vs. Block Best-of-$N$ and Block Rejection Sampling on Dyck grammar task (\ref{['sec:lm-experiments']}) with pre-trained base model and trained value function. Each circle is a baseline with specific hyperparameters (block length $\in\{1,2,4,8,16\}$ and # of candidates $\in\{2,4,8,16,32\}$); darker red indicates larger block length. Right: Example snippet of the generation tree on which VGB walks, with transition probabilities from "$\mathfrak{a}$" (self-loop not shown).
  • Figure 2: Illustration of execution of VGB at each step $t$.
  • Figure 3: Estimated KL-divergence of VGB and ActionLevelRS to $\pi^{\star}$ in ABC task (\ref{['sec:synthetic-experiments']}) for varied horizon length $H$ and # of value function training samples $N$. We repeat the experiment $10$ times for each $(H,N)$ and report the mean and standard error. See \ref{['sec:ABC_task']} for details.
  • Figure 4: Comparison of VGB against ActionLevelRS for letter avoidance task (\ref{['sec:congen-experiment']}), with varied backtracking weight $\alpha$ for VGB. Left: Winrate of VGB against ActionLevelRS under pairwise comparison of responses by GPT-4o-mini (judging for coherence). Right: Average horizon-normalized log-probabilities evaluated by Qwen-2.5-1.5B. See \ref{['sec:constrained_text_generation_task']} for details.
  • Figure 5: Average step counts for ABC task (\ref{['sec:ABC_task']}). Each experiment is repeated $10$ times and we plot the mean and standard error.
  • ...and 16 more figures

Theorems & Definitions (59)

  • Remark 1: Sampling vs reward maximization
  • Example 1: Failure of ActionLevelRS under perturbation
  • Example 2: Failure of ActionLevelRS under delay
  • Remark 2: Implicit value functions
  • Remark 3: Efficient implementation for large action spaces
  • Theorem 1: Main guarantee for VGB
  • Remark 4: Access to outcome-level tilt $\tau$
  • Theorem 2
  • Proposition 1: Best-of-$N$ guarantee under approximate coverage
  • Example 3: Greedy decoding and beam search are suboptimal with $V^{\star}_{\texttt{tilt}}$
  • ...and 49 more