Table of Contents
Fetching ...

Iterative Deepening Sampling as Efficient Test-Time Scaling

Weizhe Chen, Sven Koenig, Bistra Dilkina

TL;DR

The paper tackles the challenge of efficient test-time scaling for fixed LLMs by introducing Iterative Deepening Sampling (ID-Sampling), which strategically injects self-correction triggers as the generation budget grows. By balancing early-triggered refinement with budget expansion, ID-Sampling improves performance on hard reasoning tasks (Math-500, AIME) across Best-of-N and voting-based aggregation, while maintaining practical runtime overhead. The authors provide theoretical guarantees on token usage, extensive experiments, and ablations that highlight the sensitivity to the budget growth factor γ. Limitations include model-dependence and practical integration costs with KV-caches, suggesting future work on extending the approach to tree-search strategies and more robust reward models.

Abstract

Recent reasoning models, such as OpenAI's O1 series, have demonstrated exceptional performance on complex reasoning tasks and revealed new test-time scaling laws. Inspired by this, many people have been studying how to train models to achieve effective self-evaluation and self-correction to further enable the scaling paradigm. However, less studied is how to efficiently scale test-time compute from a fixed model, and this remains a challenge. In this paper, we address this challenge by focusing on enhancing the quality of self-reflection data generation for complex problem-solving at test time, which can also subsequently improve the training of next-generation large language models (LLMs). Specifically, we explore how systematically triggering a model's self-correction mechanisms can improve performance on challenging reasoning tasks. To this end, we propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples. Through extensive experiments on Math500 and AIME benchmarks, we demonstrate that our method achieves a higher success rate on difficult tasks and provide detailed ablation studies to analyze its effectiveness across diverse settings.

Iterative Deepening Sampling as Efficient Test-Time Scaling

TL;DR

The paper tackles the challenge of efficient test-time scaling for fixed LLMs by introducing Iterative Deepening Sampling (ID-Sampling), which strategically injects self-correction triggers as the generation budget grows. By balancing early-triggered refinement with budget expansion, ID-Sampling improves performance on hard reasoning tasks (Math-500, AIME) across Best-of-N and voting-based aggregation, while maintaining practical runtime overhead. The authors provide theoretical guarantees on token usage, extensive experiments, and ablations that highlight the sensitivity to the budget growth factor γ. Limitations include model-dependence and practical integration costs with KV-caches, suggesting future work on extending the approach to tree-search strategies and more robust reward models.

Abstract

Recent reasoning models, such as OpenAI's O1 series, have demonstrated exceptional performance on complex reasoning tasks and revealed new test-time scaling laws. Inspired by this, many people have been studying how to train models to achieve effective self-evaluation and self-correction to further enable the scaling paradigm. However, less studied is how to efficiently scale test-time compute from a fixed model, and this remains a challenge. In this paper, we address this challenge by focusing on enhancing the quality of self-reflection data generation for complex problem-solving at test time, which can also subsequently improve the training of next-generation large language models (LLMs). Specifically, we explore how systematically triggering a model's self-correction mechanisms can improve performance on challenging reasoning tasks. To this end, we propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples. Through extensive experiments on Math500 and AIME benchmarks, we demonstrate that our method achieves a higher success rate on difficult tasks and provide detailed ablation studies to analyze its effectiveness across diverse settings.

Paper Structure

This paper contains 23 sections, 1 theorem, 8 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.1

Suppose the final answer obtained through ID-Sampling needs a budget of $L$ in normal sampling without manual injection. Then the total number of budget used is no more than $\frac{\gamma * L}{\gamma - 1}$.

Figures (8)

  • Figure 1: An illustration of our method, Iterative Deepening Sampling (ID-sampling), where $B_0$ is the initial sampling budget.
  • Figure 2: The frequency of linguistic markers related to thinking appeared in Deepseek-R1 on AIME-2024 dataset.
  • Figure 2: Pass@1 and cons@32 comparison between Vanilla sampling and ID-sampling on Deepseek-R1-distill-Qwen-32B on AIME-24 dataset.
  • Figure 3: Math-500 dataset: Pass rate results for different models. The x-axis is the equivalent N after considering the extra time used by ID-sampling.
  • Figure 4: Qwen3-8B results on AIME-24.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 4.1