Table of Contents
Fetching ...

Value-Guided Search for Efficient Chain-of-Thought Reasoning

Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, Wen Sun

TL;DR

The paper tackles the high compute cost of long-context chain-of-thought reasoning in large language models. It introduces Value-Guided Search (VGS), which uses a token-level value model trained without explicit step annotations to guide block-wise search during test time. A 1.5B-parameter value model is trained on 2.5 million reasoning traces and applied to DeepSeek models with block size 4096 and DVTS, achieving better test-time compute scaling and lower FLOPs than strong baselines, while expanding the performance ceiling. The authors release the OpenR1-VM dataset, the value model, and code, enabling reproducibility and extension to other verifiable domains, though they note potential distribution drift when generator policies evolve and suggest retraining as a practical consideration.

Abstract

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

Value-Guided Search for Efficient Chain-of-Thought Reasoning

TL;DR

The paper tackles the high compute cost of long-context chain-of-thought reasoning in large language models. It introduces Value-Guided Search (VGS), which uses a token-level value model trained without explicit step annotations to guide block-wise search during test time. A 1.5B-parameter value model is trained on 2.5 million reasoning traces and applied to DeepSeek models with block size 4096 and DVTS, achieving better test-time compute scaling and lower FLOPs than strong baselines, while expanding the performance ceiling. The authors release the OpenR1-VM dataset, the value model, and code, enabling reproducibility and extension to other verifiable domains, though they note potential distribution drift when generator policies evolve and suggest retraining as a practical consideration.

Abstract

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

Paper Structure

This paper contains 29 sections, 5 equations, 20 figures, 5 tables, 3 algorithms.

Figures (20)

  • Figure 1: Performance and Efficiency of Value Guidance: (Left) Value-guided search improves the overall quality of DeepSeek-R1-Distill responses across combined competition math benchmarks (AIME 24, 25 & HMMT Feb 24, 25). The inference budget for 1.5B, 7B and 14B are $256$, $128$ and $64$ generations, respectively. (Right) Value-guided search also reduces the inference FLOPs required to achieve the same accuracy levels as majority voting, a standard TTC scaling baseline, showing value-guidance is promising for improving efficiency.
  • Figure 2: Summary of Methods. (Left) Diagrams how we collect multiple roll-ins (grey circles representing tokens) per problem, and branch off multiple roll-outs per roll-in at random points. The class label for each roll-out token is the outcome label at the very end. (Right) Shows the beam search process (beam width $2$ and budget $4$) guided by a value model.
  • Figure 3: Test-Time Compute with DeepSeek-VM-1.5B. (Left) Compares best-of-$N$ (BoN), weighted majority voting (WMV) and VGS with either BoN or WMV for the final aggregation. (Right) Compares VGS to majority voting (MV), a standard baseline that does not require a scorer.
  • Figure 4: TTC Scaling of Various Scorers. Comparison of our 1.5B value model (VM), our 1.5B Bradley-Terry reward model (BT), and two 7B state-of-the-art PRMs for two TTC scaling methods: (Left) WMV or (Right) VGS (with WMV as a final aggregation step).
  • Figure 5: VGS + WMV Performance when Guiding Larger Models. With the same DeepSeek-VM-1.5B providing guidance, search continues to improve with more test-time compute.
  • ...and 15 more figures