Table of Contents
Fetching ...

SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation

Yixiang Chen, Tianshi Zheng, Shijue Huang, Zhitao He, Yi R. Fung

TL;DR

Self-Redraft tackles the problem of achieving an intrinsic balance between exploration and exploitation in test-time scaling for code generation without execution feedback. It extends Self-Refine by explicitly prompting drafting when flaws are detected, enabling a hybrid search strategy under an execution-free setting. On LiveCodeBench across six models, Self-Redraft yields modest gains over Self-Refine but does not reach the potential suggested by the pass@8 upper bound, highlighting gaps in the model's self-guided exploration. The study identifies bottlenecks in feedback quality and discriminative judgment, reveals model-specific balancing behavior, and establishes a baseline for future work in improving critique, adaptation, and exploration strategies for robust, real-world code generation.

Abstract

Test-time scaling without interpreter feedback is essential for real-world code generation scenarios where test cases are not readily available. While existing paradigms often rely on either greedy exploitation (i.e., iterative refinement) or stochastic exploration (i.e., relying on sample-based voting or reranking mechanisms), the balance between these two dimensions remains underexplored. To investigate the LLM's intrinsic ability to balance exploitation and exploration, we introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed. Our results show that SELF-REDRAFT consistently achieves better performance than Self-Refine when converged under the same maximum number of iterations. Still, we observe that significant room for improvement remains, largely due to two core aspects of current self-redraft capabilities: constrained capacity for generating instructive feedback and fragile discriminative judgment. We also find that balancing strategies vary notably across different LLMs, reflecting distinct, model-specific behaviors. Overall, our study establishes a baseline for intrinsic exploration-exploitation balancing in test-time scaling and identifies feedback and discrimination as key areas with potential for future advances.

SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation

TL;DR

Self-Redraft tackles the problem of achieving an intrinsic balance between exploration and exploitation in test-time scaling for code generation without execution feedback. It extends Self-Refine by explicitly prompting drafting when flaws are detected, enabling a hybrid search strategy under an execution-free setting. On LiveCodeBench across six models, Self-Redraft yields modest gains over Self-Refine but does not reach the potential suggested by the pass@8 upper bound, highlighting gaps in the model's self-guided exploration. The study identifies bottlenecks in feedback quality and discriminative judgment, reveals model-specific balancing behavior, and establishes a baseline for future work in improving critique, adaptation, and exploration strategies for robust, real-world code generation.

Abstract

Test-time scaling without interpreter feedback is essential for real-world code generation scenarios where test cases are not readily available. While existing paradigms often rely on either greedy exploitation (i.e., iterative refinement) or stochastic exploration (i.e., relying on sample-based voting or reranking mechanisms), the balance between these two dimensions remains underexplored. To investigate the LLM's intrinsic ability to balance exploitation and exploration, we introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed. Our results show that SELF-REDRAFT consistently achieves better performance than Self-Refine when converged under the same maximum number of iterations. Still, we observe that significant room for improvement remains, largely due to two core aspects of current self-redraft capabilities: constrained capacity for generating instructive feedback and fragile discriminative judgment. We also find that balancing strategies vary notably across different LLMs, reflecting distinct, model-specific behaviors. Overall, our study establishes a baseline for intrinsic exploration-exploitation balancing in test-time scaling and identifies feedback and discrimination as key areas with potential for future advances.

Paper Structure

This paper contains 26 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our proposed Self-Redraft framework.
  • Figure 2: Detailed benchmark performance of LLMs on LiveCodeBench, evaluated with 16 iterations of Self-Refine and 16 iterations of Self-Redraft.
  • Figure 3: Comparison of Self-Redraft×16 and pass@8 accuracies on LiveCodeBench.
  • Figure 4: Recall on Draft versus absolute improvement of Self-Redraft x16 over Self-Refine x16.
  • Figure 5: Recall on Draft as annotated by three auxiliary models: GPT-5 mini, GLM-4.6 and Grok 4 Fast.
  • ...and 3 more figures