Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Ilia Mahrooghi; Aryo Lotfi; Emmanuel Abbe

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe

TL;DR

Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model, improves the performance of models trained with standard GRPO under the same compute budget on OpenMathReasoning dataset.

Abstract

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

TL;DR

Abstract

Paper Structure (40 sections, 14 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 40 sections, 14 equations, 12 figures, 6 tables, 2 algorithms.

Introduction
GRPO with Verifiable Reward
Teacher and Student Architecture
Student
Teacher
Joint Training Procedure
Data Selection Policy (Steps 1-2)
Student Rollouts and Optimization (Steps 3-4)
Teacher Refinement (Steps 5-6)
Experiments and Results
Experimental Setup
Main Results
Student Training Dynamics and Analysis
Teacher Training Dynamics and Analysis
Error Analysis on Unseen Samples.
...and 25 more sections

Figures (12)

Figure 1: Overview of the Goldilocks Framework. The training cycle proceeds as follows: (1) A set of $K_{\text{candidate}}$ questions is sampled randomly from the dataset; (2) The Teacher selects the optimal prompt from this candidate pool; (3) The Student generates $G$ rollouts for the selected prompt; (4) The gradient is calculated based on GRPO advantages and accumulated for the Student update; (5) Based on the empirical variance of the rollouts, Teacher targets are computed and stored in the replay buffer; (6) The Teacher is asynchronously updated using data sampled from the replay buffer.
Figure 2: Evolution of validation accuracy over training steps.
Figure 3: Average Training Reward (Success Rate). Goldilocks approach achieves higher training accuracy significantly earlier in training compared to the baseline.
Figure 4: Curriculum Mechanism. (a) The Teacher actively selects samples with higher reward variance. (b) This results in far fewer "wasted" inputs where the gradient is zero.
Figure 5: Optimization Dynamics. Goldilocks maintains larger gradient norms, preventing vanishing signals and providing a more robust optimization objective compared to the baseline.
...and 7 more figures

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

TL;DR

Abstract

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)