Table of Contents
Fetching ...

DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, Wentian Zhao

TL;DR

Post-training large language models on mixed data distributions lacks adaptive scheduling. DUMP introduces an automated distribution-level curriculum using the expected absolute advantage as a learnability proxy and a UCB-based bandit scheduler to allocate training across distributions. The framework formalizes learnability, presents a practical algorithm with sliding-window statistics and GRPO, and demonstrates faster convergence and stronger performance on logic and math datasets. The work provides theoretical justification and practical benefits for improving sample efficiency in RL-based LLM post-training, with potential for scaling to larger and multimodal models.

Abstract

Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.

DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

TL;DR

Post-training large language models on mixed data distributions lacks adaptive scheduling. DUMP introduces an automated distribution-level curriculum using the expected absolute advantage as a learnability proxy and a UCB-based bandit scheduler to allocate training across distributions. The framework formalizes learnability, presents a practical algorithm with sliding-window statistics and GRPO, and demonstrates faster convergence and stronger performance on logic and math datasets. The work provides theoretical justification and practical benefits for improving sample efficiency in RL-based LLM post-training, with potential for scaling to larger and multimodal models.

Abstract

Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.

Paper Structure

This paper contains 16 sections, 3 theorems, 13 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Given a policy $\pi_{\theta}$ and a data distribution $d$, the expected absolute advantage $\mathbb{E}_{x \sim d} \left[ \mathbb{E}_{o_i \sim \pi_\theta(\cdot|x)} \left[ |\hat{A}_i| \right] \right]$ serves as a proxy for how much that distribution $d$ can help the model improve, where the distributi

Figures (3)

  • Figure 1: Effectiveness of DUMP on the K&K puzzle dataset mixed with 12 distributions defined by the number of characters in each puzzle (Setting 1). DUMP consistently achieves higher answer reward on test dataset compared to baseline. The model used here is Qwen2.5-7B-Instruct-1M.
  • Figure 2: Curriculum (sample counts) induced by DUMP across 12 K&K puzzle distributions with increasing difficulty defined by the number of characters in each puzzle (Setting 1). Simpler distributions are automatically prioritized in early training, while more complex ones are progressively emphasized—both in an entirely automated manner—demonstrating automated distribution scheduling.
  • Figure 3: Example of prompt used.

Theorems & Definitions (5)

  • Theorem 3.1: Expected Advantage Magnitude Reflects Learnability
  • Theorem A.1: Expected Advantage Magnitude Reflects Learnability
  • proof
  • Theorem B.1
  • proof