Table of Contents
Fetching ...

SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation

Bin Xu, Yiguan Lin, Yinghao Li, Yang Gao

TL;DR

This work tackles the challenge of complex code generation by enabling LLMs to generate and refine their own intermediate reasoning paths using Monte Carlo Tree Search (MCTS). It introduces SRA-MCTS, a plug-and-play data-generation pipeline where the model self-generates, self-evaluates, and uses the resulting thinking-and-code triples to fine-tune itself, creating a positive feedback loop for continuous improvement. Across multiple model scales, SRA-MCTS yields notable gains, especially on complex benchmarks, and remains robust even when traditional CoT methods falter. The approach demonstrates the potential of autonomous reasoning augmentation in reducing supervision needs and offers public code and data to advance research in self-improving code-generation systems.

Abstract

Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems. Our code and data are public at https://github.com/DIRECT-BIT/SRA-MCTS.

SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation

TL;DR

This work tackles the challenge of complex code generation by enabling LLMs to generate and refine their own intermediate reasoning paths using Monte Carlo Tree Search (MCTS). It introduces SRA-MCTS, a plug-and-play data-generation pipeline where the model self-generates, self-evaluates, and uses the resulting thinking-and-code triples to fine-tune itself, creating a positive feedback loop for continuous improvement. Across multiple model scales, SRA-MCTS yields notable gains, especially on complex benchmarks, and remains robust even when traditional CoT methods falter. The approach demonstrates the potential of autonomous reasoning augmentation in reducing supervision needs and offers public code and data to advance research in self-improving code-generation systems.

Abstract

Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems. Our code and data are public at https://github.com/DIRECT-BIT/SRA-MCTS.

Paper Structure

This paper contains 36 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The overall workflow of our method, with data generation shown at the top and training at the bottom. SRA-MCTS guides the LLM to generate thinking, which is then used by the LLM as a part of the prompt to generate the corresponding code. The question, thinking, and code are organized as training data for supervised fine-tuning.
  • Figure 2: Self-driven reasoning augmentation process of SRA-MCTS. (a) Selection: A leaf node is selected to be expanded in the next phase. (b) Expansion: A single step is generated and assigned to the node. (c) Evaluation & Reflection: The step in the node is scored and an insight is generated as reflection. (d) Backpropagation: Reward scores are propagated back. In the notation "1-1" within a node, the first "1" indicates that it is the 1-st step in the thinking, and the second "1" denotes the 1-st variant for this step. The same logic applies to other nodes. Blue nodes represent the selected nodes, and red nodes represent newly expanded nodes.
  • Figure 3: The progressive scoring method. The state and action of the node are used as inputs, and the judgment is made sequentially from left to right based on the four principles. If the current principle is satisfied, an integer score in the corresponding interval is output; otherwise, the next principle is evaluated. If all the principles are not met, the model will give the current input a full score of 10.
  • Figure 4: Comparison results of including and excluding thinking in the training data. "w/o thinking" represents the model is trained without thinking in the training set. "C" represents the Complex split.
  • Figure 5: Comparison results of different thinking variants. The dashed line represents the performance of SRA-MCTS.
  • ...and 1 more figures