Table of Contents
Fetching ...

Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement

Xiaoqing Zhang, Yuhan Liu, Flood Sung, Xiuying Chen, Shuo Shang, Rui Yan

TL;DR

ThinkCoder tackles the latency of test-time code generation by pairing a thorough exploration phase with an optimal refinement phase, guided by a non-LLM CodeVerifier and a dynamic Testing Pool. The framework is augmented with Reinforced Self-Training (ReST), which learns from successful exploration trajectories to fine-tune the LLM offline, reducing online compute. Empirical results show ThinkCoder achieving state-of-the-art or competitive Pass@1 on MBPP and HumanEval with substantially lower computational overhead, and ReST enabling efficient performance for smaller LLMs like LLaMA2-7B. The approach offers a scalable path toward high-quality, cost-efficient code generation, with demonstrated gains across diverse benchmarks and model scales, and points to future work on on-policy training to sustain exploration diversity.

Abstract

Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds. To overcome this, we introduce \textbf{ThinkCoder}, a framework that combines thorough exploration with optimal refinement. The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision. This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error. To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM's evolution. This approach enhances LLM's exploration efficiency via preference learning, cutting costs while maintaining accuracy. ThinkCoder boosts the performance with a single LLM, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 3.0\% over MapCoder with just 6.4\% of the computation cost. Against AgentCoder, ThinkCoder achieves a 0.5\% higher Pass@1 after 2 rounds, outperforming AgentCoder's 5 rounds. Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20\% of the computational resources. These results highlight the framework's effectiveness and scalability.

Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement

TL;DR

ThinkCoder tackles the latency of test-time code generation by pairing a thorough exploration phase with an optimal refinement phase, guided by a non-LLM CodeVerifier and a dynamic Testing Pool. The framework is augmented with Reinforced Self-Training (ReST), which learns from successful exploration trajectories to fine-tune the LLM offline, reducing online compute. Empirical results show ThinkCoder achieving state-of-the-art or competitive Pass@1 on MBPP and HumanEval with substantially lower computational overhead, and ReST enabling efficient performance for smaller LLMs like LLaMA2-7B. The approach offers a scalable path toward high-quality, cost-efficient code generation, with demonstrated gains across diverse benchmarks and model scales, and points to future work on on-policy training to sustain exploration diversity.

Abstract

Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds. To overcome this, we introduce \textbf{ThinkCoder}, a framework that combines thorough exploration with optimal refinement. The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision. This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error. To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM's evolution. This approach enhances LLM's exploration efficiency via preference learning, cutting costs while maintaining accuracy. ThinkCoder boosts the performance with a single LLM, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 3.0\% over MapCoder with just 6.4\% of the computation cost. Against AgentCoder, ThinkCoder achieves a 0.5\% higher Pass@1 after 2 rounds, outperforming AgentCoder's 5 rounds. Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20\% of the computational resources. These results highlight the framework's effectiveness and scalability.

Paper Structure

This paper contains 32 sections, 1 equation, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: The end-to-end process of ThinkCoder involves $k$ thorough exploration steps followed by $n$ optimal refinement cycles. The Exploration Agent generates $k$ codes and tests simultaneously during self-exploration, storing results in a Testing Pool. The optimal refinement phase includes self-verification that selects the optimal solution with the CodeVerifier and aggregate reflection for the instruction update. The optimal refinement will be repeated recursively for $n$ cycles, ultimately leading to the final solution.
  • Figure 2: Trajectory collection with ThinkCoder and its application in ReST training for LLMs. We collect success trajectories offline based on the verification with ground truth tests, ensuring the solution aligns with human preferences. For reflection, we use LLM-generated tests, ensuring LLMs specifically address their own mistakes.
  • Figure 3: (a) The Pass@1 metric of the baseline models under different temperature $t$. (b) The Pass@k metric of the baseline models, where $k$ represents the exploration budget. (c) The variation of the Pass@1 metric for each baseline model under the ThinkCoder framework as the optimal refinement budget $n$ increases.
  • Figure 4: The variation in Pass@1 performance of the exploration code and test cases on the MBPP dataset under ThinkCoder, as evaluated by CodeQwen1.5-7B-Chat as the base LLM.
  • Figure 5: The relationship between the computational cost and performance of ThinkCoder at different budget control thresholds, where 'Pass@1' indicates the performance on the MBPP dataset, and '$\theta$' refers to the budget control threshold that allows the task to execute the next exploration process. 'Exploration Budget' represents the ratio of total requests during each iteration.
  • ...and 5 more figures