Table of Contents
Fetching ...

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi

TL;DR

This work tackles the difficulty of enabling hard reasoning in RL-style post-training for large language models, where positive samples are scarce and traditional GRPO-style training risks distribution-sharpening. The authors propose ExPO, a modular framework that generates in-distribution positive samples by conditioning self-explanations on the ground-truth answer, providing strong learning signals and guiding exploration. ExPO can be instantiated with Direct/Policy Optimization (DPO) and Group Relative Policy Optimization (GRPO) as ExP-DPO and ExP-GRPO, respectively; it introduces the ExP-SFT term and online/extractive strategies to keep positives in-distribution as the policy evolves. Empirically, ExPO improves learning efficiency and final performance on challenging maths reasoning benchmarks, notably MATH level-5, often surpassing expert-CoT-based approaches and enabling robust reasoning in settings where prior methods fail. The framework’s generality suggests broad applicability to verifiable-reward tasks beyond math, and the results advance the practical ability of models to bootstrap complex reasoning without relying on costly expert demonstrations.

Abstract

Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. It can be integrated with popular RL training methods like GRPO and DPO. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. Code is available at https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation .

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

TL;DR

This work tackles the difficulty of enabling hard reasoning in RL-style post-training for large language models, where positive samples are scarce and traditional GRPO-style training risks distribution-sharpening. The authors propose ExPO, a modular framework that generates in-distribution positive samples by conditioning self-explanations on the ground-truth answer, providing strong learning signals and guiding exploration. ExPO can be instantiated with Direct/Policy Optimization (DPO) and Group Relative Policy Optimization (GRPO) as ExP-DPO and ExP-GRPO, respectively; it introduces the ExP-SFT term and online/extractive strategies to keep positives in-distribution as the policy evolves. Empirically, ExPO improves learning efficiency and final performance on challenging maths reasoning benchmarks, notably MATH level-5, often surpassing expert-CoT-based approaches and enabling robust reasoning in settings where prior methods fail. The framework’s generality suggests broad applicability to verifiable-reward tasks beyond math, and the results advance the practical ability of models to bootstrap complex reasoning without relying on costly expert demonstrations.

Abstract

Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood of predicting the correct answer. Based on these insights, we propose -a simple and modular framework that generates such samples by conditioning on the ground-truth answer. It can be integrated with popular RL training methods like GRPO and DPO. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. Code is available at https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation .

Paper Structure

This paper contains 29 sections, 2 theorems, 25 equations, 7 figures, 6 tables.

Key Result

Lemma 1

Let $\mathcal{T} = \{(q_j, c_j, a_j)\}_{j=1}^L$ be a finite set, and the policy being a softmax policy over this set, i.e., $\pi_j := \pi_\theta(c_j, a_j|q_j) = {\exp(z_j)}/{\sum_{l=1}^L \exp(z_l)} \text{ where } z_j := f_\theta(q_j, c_j, a_j).$ Assume the following conditions hold: (1) For all $j \

Figures (7)

  • Figure 1: Illustration of the problem and our proposed solution, ExPO. On the left, the bar plots of models (with the base model being Qwen2.5-3B-Instruct) evaluated on the MATH dataset, highlighting the issue with GRPO-style methods: they primarily strengthen the model’s existing capabilities rather than enabling new ones. On the right, we present the positive and negative samples of both GRPO and our proposed ExPO method for MATH level-4 (top) and level-5 (bottom). ExPO is more effective than GRPO in guiding the model to learn for hard reasoning tasks.
  • Figure 2: Left: Negative log-likelihood of the self-explanation $\tilde{\bm{c}}$ and expert CoT. Right: Winrate labelled by GPT-4o in terms the number of correct steps of the self-explanation $\tilde{\bm{c}}$ and self-generated CoT $\bm{c}$. Both on the test split of each dataset using Qwen2.5 3B-Instruct.
  • Figure 3: Accuracy on the level-5 questions from MATH test set (left 1, 2) and on the whole test set (right 3, 4) for Qwen2.5-3B-Instruct and LLaMA-3.2-3B-Instruct across global training steps. ExP-GRPO consistently outperforms both GRPO and GRPO SFT-GT-CoT, the latter uses supervised fine-tuning on expert CoT $\bm{c}_E$. The results show that ExP-GRPO provides more effective and generalizable learning signals, leading to improved sample efficiency and higher overall performance.
  • Figure 4: We compare ExP-DPO performance on the MATH dataset across training steps for two base models: Qwen2.5-3B-Instruct and Llama3.2-3B-Instruct. Online ExP-DPO consistently outperforms its offline counterpart, confirming that updating the explanation-based positive samples improves learning efficiency and final accuracy. Qwen2.5 shows higher sample efficiency and peak accuracy than Llama3.2 under both settings.
  • Figure 5: ExP-DPO performance on the GSM8K dataset for Qwen2.5-3B-Instruct and Llama3.2-3B-Instruct models. Online ExP-DPO achieves stronger performance and faster convergence compared to the offline setting. Qwen2.5 benefits more from the online explanation updates, attaining over 85% accuracy, while Llama3.2 saturates earlier.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Lemma 1
  • Lemma 2