Table of Contents
Fetching ...

Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning

Fei Yu, Yingru Li, Benyou Wang

TL;DR

This work investigates scaling flaws in verifier-guided search for mathematical reasoning with LLMs, showing that verifiers such as $OVM$ and $PRM$ misrank or prune valid partial paths as the search scales, and that this problem intensifies on harder and out-of-distribution tasks. Through systematic experiments on GSM8K and MATH using Mistral 7B and DeepSeekMath 7B, the authors identify selection-stage verifier failures as the main bottleneck and reveal that increasing the candidate pool or parallel paths yields diminishing returns relative to repeated sampling. They analyze generation versus selection and demonstrate that many failures occur during selection, especially when valid paths are sparse. To mitigate these issues, they explore reducing reliance on verifiers via stochastic selection and a one-time Monte Carlo rollout, finding meaningful improvements and highlighting the limits of verifier-guided approaches. Overall, the paper emphasizes fundamental limitations of verifier-driven search in scalable reasoning and suggests directions toward uncertainty-aware, verifier-agnostic strategies for practical deployment.

Abstract

Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement. Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths. However, we identify a critical limitation: scaling flaws, prevalent across different models (Mistral 7B and DeepSeekMath 7B), benchmarks (GSM8K and MATH), and verifiers (outcome value models and process reward models). As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling. Our analysis attributes this to verifier failures, where imperfect verifiers misrank candidates and erroneously prune all valid paths. These issues are further exacerbated in challenging and out-of-distribution problems, restricting search effectiveness. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods. Our findings reveal fundamental limitations in verifier-guided search and suggest future directions.

Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning

TL;DR

This work investigates scaling flaws in verifier-guided search for mathematical reasoning with LLMs, showing that verifiers such as and misrank or prune valid partial paths as the search scales, and that this problem intensifies on harder and out-of-distribution tasks. Through systematic experiments on GSM8K and MATH using Mistral 7B and DeepSeekMath 7B, the authors identify selection-stage verifier failures as the main bottleneck and reveal that increasing the candidate pool or parallel paths yields diminishing returns relative to repeated sampling. They analyze generation versus selection and demonstrate that many failures occur during selection, especially when valid paths are sparse. To mitigate these issues, they explore reducing reliance on verifiers via stochastic selection and a one-time Monte Carlo rollout, finding meaningful improvements and highlighting the limits of verifier-guided approaches. Overall, the paper emphasizes fundamental limitations of verifier-driven search in scalable reasoning and suggests directions toward uncertainty-aware, verifier-agnostic strategies for practical deployment.

Abstract

Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement. Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths. However, we identify a critical limitation: scaling flaws, prevalent across different models (Mistral 7B and DeepSeekMath 7B), benchmarks (GSM8K and MATH), and verifiers (outcome value models and process reward models). As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling. Our analysis attributes this to verifier failures, where imperfect verifiers misrank candidates and erroneously prune all valid paths. These issues are further exacerbated in challenging and out-of-distribution problems, restricting search effectiveness. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods. Our findings reveal fundamental limitations in verifier-guided search and suggest future directions.

Paper Structure

This paper contains 51 sections, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Scaling Flaws in OVM-guided search and PRM-guided search on GSM8K and MATH (scaling of sample sizes). While verifier-guided search outperforms repeated sampling initially, its performance increases at a slower rate, ultimately underperforming repeated sampling.
  • Figure 2: Scaling Flaws in OVM-guided search and PRM-guided search on MATH and OOD-L5 (scaling generated candidate size).
  • Figure 3: Scaling failures of verifier selection at the first selection stage across various beam sizes on MATH and OOD-L5.
  • Figure 4: Distribution of OVM failures across groups based on valid path sparisty on MATH and OOD-L5 (DeepSeekMath 7B).

Theorems & Definitions (1)

  • Definition