Iterative Deepening Sampling as Efficient Test-Time Scaling
Weizhe Chen, Sven Koenig, Bistra Dilkina
TL;DR
The paper tackles the challenge of efficient test-time scaling for fixed LLMs by introducing Iterative Deepening Sampling (ID-Sampling), which strategically injects self-correction triggers as the generation budget grows. By balancing early-triggered refinement with budget expansion, ID-Sampling improves performance on hard reasoning tasks (Math-500, AIME) across Best-of-N and voting-based aggregation, while maintaining practical runtime overhead. The authors provide theoretical guarantees on token usage, extensive experiments, and ablations that highlight the sensitivity to the budget growth factor γ. Limitations include model-dependence and practical integration costs with KV-caches, suggesting future work on extending the approach to tree-search strategies and more robust reward models.
Abstract
Recent reasoning models, such as OpenAI's O1 series, have demonstrated exceptional performance on complex reasoning tasks and revealed new test-time scaling laws. Inspired by this, many people have been studying how to train models to achieve effective self-evaluation and self-correction to further enable the scaling paradigm. However, less studied is how to efficiently scale test-time compute from a fixed model, and this remains a challenge. In this paper, we address this challenge by focusing on enhancing the quality of self-reflection data generation for complex problem-solving at test time, which can also subsequently improve the training of next-generation large language models (LLMs). Specifically, we explore how systematically triggering a model's self-correction mechanisms can improve performance on challenging reasoning tasks. To this end, we propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples. Through extensive experiments on Math500 and AIME benchmarks, we demonstrate that our method achieves a higher success rate on difficult tasks and provide detailed ablation studies to analyze its effectiveness across diverse settings.
