Value-Guided Search for Efficient Chain-of-Thought Reasoning
Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, Wen Sun
TL;DR
The paper tackles the high compute cost of long-context chain-of-thought reasoning in large language models. It introduces Value-Guided Search (VGS), which uses a token-level value model trained without explicit step annotations to guide block-wise search during test time. A 1.5B-parameter value model is trained on 2.5 million reasoning traces and applied to DeepSeek models with block size 4096 and DVTS, achieving better test-time compute scaling and lower FLOPs than strong baselines, while expanding the performance ceiling. The authors release the OpenR1-VM dataset, the value model, and code, enabling reproducibility and extension to other verifiable domains, though they note potential distribution drift when generator policies evolve and suggest retraining as a practical consideration.
Abstract
In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.
