JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu
TL;DR
Addresses whether reinforcement learning complexity is necessary for small LLMs and demonstrates that a minimal, single-stage RL recipe can reach state-of-the-art performance on two $1.5$B reasoning models, achieving about $54.9\%$ and $64.3\%$ average accuracy across nine mathematical benchmarks. The JustRL approach uses GRPO with a lightweight verifier and fixed hyperparameters, delivering roughly half the compute of multi-stage baselines while maintaining stable, monotonic training. Across both DeepSeek- and Nemotron-based backbones, JustRL matches or surpasses complex methods, with ablations showing that some standard tricks can reduce performance by collapsing exploration. The work advocates reexamining RL baselines for small models and suggests that scaling simple approaches may suffice before adding complexity.
Abstract
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
