Table of Contents
Fetching ...

Seed-CTS: Unleashing the Power of Tree Search for Superior Performance in Competitive Coding Tasks

Hao Wang, Boyi Liu, Yufeng Zhang, Jie Chen

TL;DR

Seed-CTS introduces a token-level Monte Carlo Tree Search (MCTS) framework integrated with Chain-of-Thought prompting to boost competition-level code generation using open-source LLMs. By applying a P-UCB-guided search with TOP extsubscript{K} expansion, hard/partial reward simulations, and backpropagation, the method significantly improves pass@k on LiveCodeBench-Medium and Hard, with CoT prompting yielding near-proprietary-model performance on several settings (e.g., pass@100 of $0.351$ on Hard for $Qwen2.5$-Coder-$32 ext{B}$-Instruct). The approach is model-agnostic, demonstrates efficiency (few generations per problem) and the potential to synthesize high-quality SFT data directly from the target model, and shows competitive results on CodeContest-Test as well. These findings suggest that token-level search combined with structured reasoning can substantially elevate open-source models for challenging code-generation tasks and reduce reliance on large black-box LLMs.

Abstract

Competition-level code generation tasks pose significant challenges for current state-of-the-art large language models (LLMs). For example, on the LiveCodeBench-Hard dataset, models such as O1-Mini and O1-Preview achieve pass@1 rates of only 0.366 and 0.143, respectively. While tree search techniques have proven effective in domains like mathematics and general coding, their potential in competition-level code generation remains under-explored. In this work, we propose a novel token-level tree search method specifically designed for code generation. Leveraging Qwen2.5-Coder-32B-Instruct, our approach achieves a pass rate of 0.305 on LiveCodeBench-Hard, surpassing the pass@100 performance of GPT4o-0513 (0.245). Furthermore, by integrating Chain-of-Thought (CoT) prompting, we improve our method's performance to 0.351, approaching O1-Mini's pass@1 rate. To ensure reproducibility, we report the average number of generations required per problem by our tree search method on the test set. Our findings underscore the potential of tree search to significantly enhance performance on competition-level code generation tasks. This opens up new possibilities for large-scale synthesis of challenging code problems supervised fine-tuning (SFT) data, advancing competition-level code generation tasks.

Seed-CTS: Unleashing the Power of Tree Search for Superior Performance in Competitive Coding Tasks

TL;DR

Seed-CTS introduces a token-level Monte Carlo Tree Search (MCTS) framework integrated with Chain-of-Thought prompting to boost competition-level code generation using open-source LLMs. By applying a P-UCB-guided search with TOP extsubscript{K} expansion, hard/partial reward simulations, and backpropagation, the method significantly improves pass@k on LiveCodeBench-Medium and Hard, with CoT prompting yielding near-proprietary-model performance on several settings (e.g., pass@100 of on Hard for -Coder--Instruct). The approach is model-agnostic, demonstrates efficiency (few generations per problem) and the potential to synthesize high-quality SFT data directly from the target model, and shows competitive results on CodeContest-Test as well. These findings suggest that token-level search combined with structured reasoning can substantially elevate open-source models for challenging code-generation tasks and reduce reliance on large black-box LLMs.

Abstract

Competition-level code generation tasks pose significant challenges for current state-of-the-art large language models (LLMs). For example, on the LiveCodeBench-Hard dataset, models such as O1-Mini and O1-Preview achieve pass@1 rates of only 0.366 and 0.143, respectively. While tree search techniques have proven effective in domains like mathematics and general coding, their potential in competition-level code generation remains under-explored. In this work, we propose a novel token-level tree search method specifically designed for code generation. Leveraging Qwen2.5-Coder-32B-Instruct, our approach achieves a pass rate of 0.305 on LiveCodeBench-Hard, surpassing the pass@100 performance of GPT4o-0513 (0.245). Furthermore, by integrating Chain-of-Thought (CoT) prompting, we improve our method's performance to 0.351, approaching O1-Mini's pass@1 rate. To ensure reproducibility, we report the average number of generations required per problem by our tree search method on the test set. Our findings underscore the potential of tree search to significantly enhance performance on competition-level code generation tasks. This opens up new possibilities for large-scale synthesis of challenging code problems supervised fine-tuning (SFT) data, advancing competition-level code generation tasks.

Paper Structure

This paper contains 19 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Pass rates of MCTS with DeepSeekCoder-6.7B-Instruct and Qwen2.5-32B-Instruct on LiveCodeBench-Hard: each model surpasses it's own pass@100 rates at $\text{max\_rollouts}=8$. Notably, MCTS with Qwen2.5-32B-Instruct and $\text{max\_rollouts}=16$ outperforms pass@100 of both Qwen2.5-72B-Instruct and GPT4o-0513. In addition, when combined with CoT prompting, MCTS with Qwen2.5-32B-Instruct achieves a pass rate of 0.351 nearing O1-Mini's pass@1 rate of 0.366.
  • Figure 2: Results on LiveCodeBench-Medium: (a) Comparison of the pass rates of MCTS with different max_rollouts against the pass@1, pass@10, and pass@100 rates of Qwen2.5-72B-Instruct-api. (b) Comparison of pass rates of MCTS against pass@k with k selected as the mean number of generations in the corresponding MCTS run. (c)-(d) Comparison of the pass rates of MCTS under different max_rollouts against the pass@100 rates of sota models.
  • Figure 3: Results on LiveCodeBench-Hard: (a) Comparison of the pass rates of MCTS with different max_rollouts against the pass@1, pass@10, and pass@100 rates of Qwen2.5-72B-Instruct-api. (b) Comparison of pass rates of MCTS against pass@k with k selected as the mean number of generations in the corresponding MCTS run. (c)-(d) Comparison of the pass rates of MCTS under different max_rollouts against the pass@100 rates of sota models.
  • Figure 4: When different models are used as the generating model for MCTS, the pass rates of MCTS tend to increase correspondingly with the enhancement of model capabilities. The pass@100 rates of DeepSeekCoder-6.7B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-14B-Instruct, and Qwen2.5-32B-Instruct are 0.080, 0.099, 0.189, and 0.197, respectively. It can be observed that after employing MCTS, even with max_rollouts set to only 16, the performance of each model on LiveCodeBench-Hard significantly exceeds its own pass@100 rate.
  • Figure 5: To ensure fairness in comparison with the pass@100 rates, we also recorded the average number of generations produced by MCTS when max_rollouts is set to 16 across different models. It can be observed that for DeepSeekCoder-6.7B-Instruct, Qwen2.5-7B-Instruct, and Qwen2.5-14B-Instruct, the average number of generations is approximately 80, which is significantly lower than 100. For Qwen2.5-32B-Instruct, the average number of generations is even lower.
  • ...and 2 more figures