Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models
Datta Nimmaturi, Vaishnavi Bhargava, Rajat Ghosh, Johnu George, Debojyoti Dutta
TL;DR
The paper tackles the high compute cost of reinforcement-based fine-tuning for large reasoning models by proposing predictive scaling laws for GRPO training. It couples LoRA with GRPO on quantized Llama and Qwen models, deriving a reward progression law that reveals three universal training phases and strong potential for early stopping. The results show that much of the training yield is captured early, enabling efficient resource-friendly fine-tuning without sacrificing final performance, and that scaling effects dominate over architectural differences. These insights provide practical guidance for model selection, scheduling, and resource planning in GRPO-based reasoning tasks, broadening access to high-performance reasoning capabilities.
Abstract
Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.
