ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi
TL;DR
This work tackles the difficulty of enabling hard reasoning in RL-style post-training for large language models, where positive samples are scarce and traditional GRPO-style training risks distribution-sharpening. The authors propose ExPO, a modular framework that generates in-distribution positive samples by conditioning self-explanations on the ground-truth answer, providing strong learning signals and guiding exploration. ExPO can be instantiated with Direct/Policy Optimization (DPO) and Group Relative Policy Optimization (GRPO) as ExP-DPO and ExP-GRPO, respectively; it introduces the ExP-SFT term and online/extractive strategies to keep positives in-distribution as the policy evolves. Empirically, ExPO improves learning efficiency and final performance on challenging maths reasoning benchmarks, notably MATH level-5, often surpassing expert-CoT-based approaches and enabling robust reasoning in settings where prior methods fail. The framework’s generality suggests broad applicability to verifiable-reward tasks beyond math, and the results advance the practical ability of models to bootstrap complex reasoning without relying on costly expert demonstrations.
Abstract
Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. It can be integrated with popular RL training methods like GRPO and DPO. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. Code is available at https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation .
