Seed-CTS: Unleashing the Power of Tree Search for Superior Performance in Competitive Coding Tasks
Hao Wang, Boyi Liu, Yufeng Zhang, Jie Chen
TL;DR
Seed-CTS introduces a token-level Monte Carlo Tree Search (MCTS) framework integrated with Chain-of-Thought prompting to boost competition-level code generation using open-source LLMs. By applying a P-UCB-guided search with TOP extsubscript{K} expansion, hard/partial reward simulations, and backpropagation, the method significantly improves pass@k on LiveCodeBench-Medium and Hard, with CoT prompting yielding near-proprietary-model performance on several settings (e.g., pass@100 of $0.351$ on Hard for $Qwen2.5$-Coder-$32 ext{B}$-Instruct). The approach is model-agnostic, demonstrates efficiency (few generations per problem) and the potential to synthesize high-quality SFT data directly from the target model, and shows competitive results on CodeContest-Test as well. These findings suggest that token-level search combined with structured reasoning can substantially elevate open-source models for challenging code-generation tasks and reduce reliance on large black-box LLMs.
Abstract
Competition-level code generation tasks pose significant challenges for current state-of-the-art large language models (LLMs). For example, on the LiveCodeBench-Hard dataset, models such as O1-Mini and O1-Preview achieve pass@1 rates of only 0.366 and 0.143, respectively. While tree search techniques have proven effective in domains like mathematics and general coding, their potential in competition-level code generation remains under-explored. In this work, we propose a novel token-level tree search method specifically designed for code generation. Leveraging Qwen2.5-Coder-32B-Instruct, our approach achieves a pass rate of 0.305 on LiveCodeBench-Hard, surpassing the pass@100 performance of GPT4o-0513 (0.245). Furthermore, by integrating Chain-of-Thought (CoT) prompting, we improve our method's performance to 0.351, approaching O1-Mini's pass@1 rate. To ensure reproducibility, we report the average number of generations required per problem by our tree search method on the test set. Our findings underscore the potential of tree search to significantly enhance performance on competition-level code generation tasks. This opens up new possibilities for large-scale synthesis of challenging code problems supervised fine-tuning (SFT) data, advancing competition-level code generation tasks.
