Table of Contents
Fetching ...

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe

TL;DR

Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model, improves the performance of models trained with standard GRPO under the same compute budget on OpenMathReasoning dataset.

Abstract

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

TL;DR

Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model, improves the performance of models trained with standard GRPO under the same compute budget on OpenMathReasoning dataset.

Abstract

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.
Paper Structure (40 sections, 14 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 40 sections, 14 equations, 12 figures, 6 tables, 2 algorithms.

Figures (12)

  • Figure 1: Overview of the Goldilocks Framework. The training cycle proceeds as follows: (1) A set of $K_{\text{candidate}}$ questions is sampled randomly from the dataset; (2) The Teacher selects the optimal prompt from this candidate pool; (3) The Student generates $G$ rollouts for the selected prompt; (4) The gradient is calculated based on GRPO advantages and accumulated for the Student update; (5) Based on the empirical variance of the rollouts, Teacher targets are computed and stored in the replay buffer; (6) The Teacher is asynchronously updated using data sampled from the replay buffer.
  • Figure 2: Evolution of validation accuracy over training steps.
  • Figure 3: Average Training Reward (Success Rate). Goldilocks approach achieves higher training accuracy significantly earlier in training compared to the baseline.
  • Figure 4: Curriculum Mechanism. (a) The Teacher actively selects samples with higher reward variance. (b) This results in far fewer "wasted" inputs where the gradient is zero.
  • Figure 5: Optimization Dynamics. Goldilocks maintains larger gradient norms, preventing vanishing signals and providing a more robust optimization objective compared to the baseline.
  • ...and 7 more figures