Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Huchen Jiang; Yangyang Ma; Chaofan Ding; Kexin Luan; Xinhan Di

Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Huchen Jiang, Yangyang Ma, Chaofan Ding, Kexin Luan, Xinhan Di

TL;DR

The paper addresses improving stepwise reasoning in LLMs by integrating intrinsic self-correction with Monte Carlo Tree Search in a two-stage reinforcement learning framework. Stage I trains a self-correcting LLM using self-generated data and an oracle reward to improve correctness. Stage II combines outer-loop and inner-loop reinforcement learning with step-level MCTS to enhance verification along reasoning paths, using four phases: Select, Expand, Enhanced-Self-Verify, and Backup. Empirically, the approach yields consistent accuracy gains on GSM8K and MATH across multiple baselines, demonstrating the practical potential for enhancing arithmetic reasoning in LLMs.

Abstract

With current state-of-the-art approaches aimed at enhancing the reasoning capabilities of Large Language Models(LLMs) through iterative preference learning inspired by AlphaZero, we propose to further enhance the step-wise reasoning capabilities through intrinsic self-correction to some extent. Our work leverages step-wise preference learning to enhance self-verification via reinforcement learning. We initially conduct our work through a two-stage training procedure. At the first stage, the self-correction reasoning ability of an LLM is enhanced through its own predictions, relying entirely on self-generated data within the intrinsic self-correction to some extent. At the second stage, the baseline step-wise preference learning is leveraged via the application of the enhanced self-correct policy achieved at the first stage. In the evaluation of arithmetic reasoning tasks, our approach outperforms OpenMath2-Llama3.1-8B, dart-math-mistral-7b-uniform on MATH with increases in accuracy to 71.34%(+4.18%) and 48.06%(+4.94%) and LLama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.1 on GSM8K with increases in accuracy to 86.76%(+2.00%) and 38.06%(+2.28%).

Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

TL;DR

Abstract

Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)