Table of Contents
Fetching ...

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu

TL;DR

Addresses whether reinforcement learning complexity is necessary for small LLMs and demonstrates that a minimal, single-stage RL recipe can reach state-of-the-art performance on two $1.5$B reasoning models, achieving about $54.9\%$ and $64.3\%$ average accuracy across nine mathematical benchmarks. The JustRL approach uses GRPO with a lightweight verifier and fixed hyperparameters, delivering roughly half the compute of multi-stage baselines while maintaining stable, monotonic training. Across both DeepSeek- and Nemotron-based backbones, JustRL matches or surpasses complex methods, with ablations showing that some standard tricks can reduce performance by collapsing exploration. The work advocates reexamining RL baselines for small models and suggests that scaling simple approaches may suffice before adding complexity.

Abstract

Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

TL;DR

Addresses whether reinforcement learning complexity is necessary for small LLMs and demonstrates that a minimal, single-stage RL recipe can reach state-of-the-art performance on two B reasoning models, achieving about and average accuracy across nine mathematical benchmarks. The JustRL approach uses GRPO with a lightweight verifier and fixed hyperparameters, delivering roughly half the compute of multi-stage baselines while maintaining stable, monotonic training. Across both DeepSeek- and Nemotron-based backbones, JustRL matches or surpasses complex methods, with ablations showing that some standard tricks can reduce performance by collapsing exploration. The work advocates reexamining RL baselines for small models and suggests that scaling simple approaches may suffice before adding complexity.

Abstract

Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2 less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

Paper Structure

This paper contains 12 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: JustRL achieves substantial performance gains through simple, single-stage training. (a) The AIME24 (avg@32) performance curve for scaling from DeepSeek-R1-Distill-Qwen-1.5B into JustRL-DeepSeek-1.5B, from 28% to 58% over 4,000 steps; (b) from OpenMath-Nemotron-1.5B into our 1.5B reasoning SOTA model JustRL-Nemotron-1.5B, showing its training journey to the final 70+% score over 3,000 steps.
  • Figure 2: Training Dynamics of JustRL-DeepSeek-1.5B. (a) Policy entropy remains stable throughout training, oscillating naturally around 1.2-1.4 without drift or collapse. (b) Mean reward shows smooth, monotonic improvement from negative to $\sim$0.4, indicating consistent learning without plateau-breaking interventions. (c) Response length naturally converges from initial verbosity ($\sim$7,000 tokens) to a stable range (4,000-5,000 tokens) with 16k max context length, without explicit length penalties.
  • Figure 3: Ablation Study Results. (a) AIME 2024 performance diverges after $\sim$2,000 steps. Our base recipe reaches 55%, while adding overlong penalty plateaus at 50%, and adding both modifications plateaus at 45%. (b) Entropy: Both modifications show collapsed exploration (entropy $\sim$0.5-0.6) compared to healthy oscillation in the base recipe ($\sim$1.2-1.4).