Table of Contents
Fetching ...

Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Huchen Jiang, Yangyang Ma, Chaofan Ding, Kexin Luan, Xinhan Di

TL;DR

The paper addresses improving stepwise reasoning in LLMs by integrating intrinsic self-correction with Monte Carlo Tree Search in a two-stage reinforcement learning framework. Stage I trains a self-correcting LLM using self-generated data and an oracle reward to improve correctness. Stage II combines outer-loop and inner-loop reinforcement learning with step-level MCTS to enhance verification along reasoning paths, using four phases: Select, Expand, Enhanced-Self-Verify, and Backup. Empirically, the approach yields consistent accuracy gains on GSM8K and MATH across multiple baselines, demonstrating the practical potential for enhancing arithmetic reasoning in LLMs.

Abstract

With current state-of-the-art approaches aimed at enhancing the reasoning capabilities of Large Language Models(LLMs) through iterative preference learning inspired by AlphaZero, we propose to further enhance the step-wise reasoning capabilities through intrinsic self-correction to some extent. Our work leverages step-wise preference learning to enhance self-verification via reinforcement learning. We initially conduct our work through a two-stage training procedure. At the first stage, the self-correction reasoning ability of an LLM is enhanced through its own predictions, relying entirely on self-generated data within the intrinsic self-correction to some extent. At the second stage, the baseline step-wise preference learning is leveraged via the application of the enhanced self-correct policy achieved at the first stage. In the evaluation of arithmetic reasoning tasks, our approach outperforms OpenMath2-Llama3.1-8B, dart-math-mistral-7b-uniform on MATH with increases in accuracy to 71.34%(+4.18%) and 48.06%(+4.94%) and LLama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.1 on GSM8K with increases in accuracy to 86.76%(+2.00%) and 38.06%(+2.28%).

Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

TL;DR

The paper addresses improving stepwise reasoning in LLMs by integrating intrinsic self-correction with Monte Carlo Tree Search in a two-stage reinforcement learning framework. Stage I trains a self-correcting LLM using self-generated data and an oracle reward to improve correctness. Stage II combines outer-loop and inner-loop reinforcement learning with step-level MCTS to enhance verification along reasoning paths, using four phases: Select, Expand, Enhanced-Self-Verify, and Backup. Empirically, the approach yields consistent accuracy gains on GSM8K and MATH across multiple baselines, demonstrating the practical potential for enhancing arithmetic reasoning in LLMs.

Abstract

With current state-of-the-art approaches aimed at enhancing the reasoning capabilities of Large Language Models(LLMs) through iterative preference learning inspired by AlphaZero, we propose to further enhance the step-wise reasoning capabilities through intrinsic self-correction to some extent. Our work leverages step-wise preference learning to enhance self-verification via reinforcement learning. We initially conduct our work through a two-stage training procedure. At the first stage, the self-correction reasoning ability of an LLM is enhanced through its own predictions, relying entirely on self-generated data within the intrinsic self-correction to some extent. At the second stage, the baseline step-wise preference learning is leveraged via the application of the enhanced self-correct policy achieved at the first stage. In the evaluation of arithmetic reasoning tasks, our approach outperforms OpenMath2-Llama3.1-8B, dart-math-mistral-7b-uniform on MATH with increases in accuracy to 71.34%(+4.18%) and 48.06%(+4.94%) and LLama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.1 on GSM8K with increases in accuracy to 86.76%(+2.00%) and 38.06%(+2.28%).

Paper Structure

This paper contains 16 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of Towards Intrinsic Self-Correction Enhancement via Iterative Preference Learning. It's consisted of training two policies, self-correctness-policy in the inner-loop reinforcement learning and outer-loop-policy in the outer-loop reinforcement learning. Here, the purple box denotes the learned policy for the first stage. The pink box denotes the learned policy for the second stage.
  • Figure 2: Towards intrinsic Self-Correct LLM in the Inner Loop (Stage I). Here, the green box denotes the input prompt for the LLM at the first stage. The orange box denotes the respondence of the first attempt given the prompt as the input. Then, for the second attempt, the large language model receives the respondence of the first attempt together with the green prompt as input and produces the response of the second attempt(orange box).
  • Figure 3: Step-wise Iterative Preference Learning in the Outer Loop. The red circle denotes the termination, the square denotes the intermediate node (Stage II). The orange box denotes the policy learned at the first stage. The Monte Carlo Tree and the step-level preference learning both represent two parts of the iterative preference learning via boosted MCTS.