Table of Contents
Fetching ...

Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

Datta Nimmaturi, Vaishnavi Bhargava, Rajat Ghosh, Johnu George, Debojyoti Dutta

TL;DR

The paper tackles the high compute cost of reinforcement-based fine-tuning for large reasoning models by proposing predictive scaling laws for GRPO training. It couples LoRA with GRPO on quantized Llama and Qwen models, deriving a reward progression law that reveals three universal training phases and strong potential for early stopping. The results show that much of the training yield is captured early, enabling efficient resource-friendly fine-tuning without sacrificing final performance, and that scaling effects dominate over architectural differences. These insights provide practical guidance for model selection, scheduling, and resource planning in GRPO-based reasoning tasks, broadening access to high-performance reasoning capabilities.

Abstract

Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.

Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

TL;DR

The paper tackles the high compute cost of reinforcement-based fine-tuning for large reasoning models by proposing predictive scaling laws for GRPO training. It couples LoRA with GRPO on quantized Llama and Qwen models, deriving a reward progression law that reveals three universal training phases and strong potential for early stopping. The results show that much of the training yield is captured early, enabling efficient resource-friendly fine-tuning without sacrificing final performance, and that scaling effects dominate over architectural differences. These insights provide practical guidance for model selection, scheduling, and resource planning in GRPO-based reasoning tasks, broadening access to high-performance reasoning capabilities.

Abstract

Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.

Paper Structure

This paper contains 17 sections, 23 equations, 1 figure.

Figures (1)

  • Figure 1: GRPO training reward convergence across all model configurations. All four models exhibit consistent sigmoid-shaped learning curves with similar phase transitions despite differing parameter counts and architectures.