Reinforcement Learning is all You Need
Yongsheng Lian
TL;DR
The paper investigates reinforcement learning only training to improve reasoning in a large language model using the Countdown Game as the sole training signal. By employing rule based rewards and Group Relative Policy Optimization GRPO, the authors demonstrate improved generalization on four of five benchmarks and replicate emergent aha moments, while highlighting that longer response length does not necessarily equate to better reasoning. The study provides a detailed analysis of early formatting violations, human like thinking, and the relationship between reasoning and final accuracy, and discusses limitations of evaluation and reward signals. The findings suggest RL only training can meaningfully enhance numerical reasoning and guide future work on reward structure and robust evaluation methods for emergent reasoning capabilities.
Abstract
Inspired by the success of DeepSeek R1 in reasoning via reinforcement learning without human feedback, we train a 3B language model using the Countdown Game with pure reinforcement learning. Our model outperforms baselines on four of five benchmarks, demonstrating improved generalization beyond its training data. Notably, response length does not correlate with reasoning quality, and while "aha moments" emerge, they do not always yield correct answers. These findings highlight the potential of RL-only training for reasoning enhancement and suggest future work on refining reward structures to bridge emergent insights with accuracy.
