Table of Contents
Fetching ...

AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting

Renda Li, Hailang Huang, Fei Wei, Feng Xiong, Yong Wang, Xiangxiang Chu

TL;DR

AdaCuRL tackles gradient starvation and policy degradation in gradient-regime policy optimization by introducing an adaptive curriculum RL framework with coarse-to-fine difficulty estimation, bucket-based training, and a self-pacing extension. It integrates a sparse KL strategy and adaptive reference to prevent degradation while revisiting historical data to mitigate forgetting. The approach yields consistent performance gains over GRPO and SFT on both multimodal and unimodal reasoning benchmarks, including notable improvements in mathematical reasoning for Qwen models and additional gains from Re-AdaCuRL. The work demonstrates that dynamically aligning data difficulty with model capability, combined with data revisitation, can substantially improve reasoning in both LLMs and MLLMs without relying on labor-intensive CoT annotations.

Abstract

Reinforcement learning (RL) has demonstrated considerable potential for enhancing reasoning in large language models (LLMs). However, existing methods suffer from Gradient Starvation and Policy Degradation when training directly on samples with mixed difficulty. To mitigate this, prior approaches leverage Chain-of-Thought (CoT) data, but the construction of high-quality CoT annotations remains labor-intensive. Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. This approach dynamically aligns data difficulty with model capability and incorporates a data revisitation mechanism to mitigate catastrophic forgetting. Furthermore, AdaCuRL employs adaptive reference and sparse KL strategies to prevent Policy Degradation. Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.

AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting

TL;DR

AdaCuRL tackles gradient starvation and policy degradation in gradient-regime policy optimization by introducing an adaptive curriculum RL framework with coarse-to-fine difficulty estimation, bucket-based training, and a self-pacing extension. It integrates a sparse KL strategy and adaptive reference to prevent degradation while revisiting historical data to mitigate forgetting. The approach yields consistent performance gains over GRPO and SFT on both multimodal and unimodal reasoning benchmarks, including notable improvements in mathematical reasoning for Qwen models and additional gains from Re-AdaCuRL. The work demonstrates that dynamically aligning data difficulty with model capability, combined with data revisitation, can substantially improve reasoning in both LLMs and MLLMs without relying on labor-intensive CoT annotations.

Abstract

Reinforcement learning (RL) has demonstrated considerable potential for enhancing reasoning in large language models (LLMs). However, existing methods suffer from Gradient Starvation and Policy Degradation when training directly on samples with mixed difficulty. To mitigate this, prior approaches leverage Chain-of-Thought (CoT) data, but the construction of high-quality CoT annotations remains labor-intensive. Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. This approach dynamically aligns data difficulty with model capability and incorporates a data revisitation mechanism to mitigate catastrophic forgetting. Furthermore, AdaCuRL employs adaptive reference and sparse KL strategies to prevent Policy Degradation. Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.

Paper Structure

This paper contains 40 sections, 9 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Cumulative invalid samples during GRPO training: shuffled data (Baseline) vs curriculum learning (Ours) on standard open-source datasets.
  • Figure 2: The overall framework of AdaCuRL. Difficulty Estimation (left) samples a training subset from a large-scale dataset to match a target difficulty distribution and sorts the data from easy to hard. Curriculum Reinforcement Learning (right) monitors the average accuracy reward during training to assess the model’s mastery of the current difficulty level and progressively introduces more challenging samples. In addition, AdaCuRL incorporates sparse KL and adaptive reference mechanisms to prevent degradation of the model’s reasoning capability.
  • Figure 3: (Left) The proportion of samples from each of the three coarse-grained groups ($\mathcal{G}1/\mathcal{G}_2/\mathcal{G}_3$) that fall into each of the three fine-grained groups (F-$\mathcal{G}_1/\mathcal{G}_2/\mathcal{G}_3$) after fine-grained estimation. (Right) The difficulty distribution of coarse-grained sampling compared to that after fine-grained difficulty estimation.
  • Figure 4: Training dynamics under AdaCuRL curriculum scheduling and randomly shuffled data. (Left) Accuracy reward. (Right) KL loss.
  • Figure 5: Reward and completion length during training with different difficulty distributions using Qwen2.5-VL-3B.
  • ...and 3 more figures