Table of Contents
Fetching ...

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue

TL;DR

This work proposes ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs), which learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement.

Abstract

Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

TL;DR

This work proposes ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs), which learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement.

Abstract

Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
Paper Structure (90 sections, 11 theorems, 100 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 90 sections, 11 theorems, 100 equations, 9 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

$\mathbb{E}[\hat{U}_{\boldsymbol{x}}^t]=u_{\boldsymbol{x}}^t$.

Figures (9)

  • Figure 1: Co-adaptive online training loop of Actor-Curator. At each RL step, a learned curator adaptively selects problems from a large problem bank instead of uniform sampling. The actor is updated on these problems, after which a bandit-style reward based on post-update policy improvement trains the curator. As the actor improves, the curator adapts to prioritize problems that yield the greatest expected performance gains.
  • Figure 2: Single training iteration of Actor-Curator. At each iteration, a candidate set of problems is sampled from a fixed proposal distribution. The curator reweights this candidate set to select training problems for the actor. After the actor update, per-problem policy improvement is estimated using pre- and post-update policies. The curator observes bandit feedback only on selected problems and is updated using a PPO-style approximation of online stochastic mirror descent.
  • Figure 3: Training dynamics.Actor-Curator Test performance over training on three datasets, showing faster convergence and higher final accuracy.
  • Figure 4: Training speed-up.Actor-Curator attains high test accuracy with significantly fewer steps.
  • Figure 7: Ablation results. (a) Actor-Curator is compatible with alternative actor update methods such as GRPO and yields consistent performance gains. (b) The combination of the OSMD curator objective and the policy-improvement target achieves superior performance compared to alternative targets and loss functions. (c) Varying curator model size leads to similar long-term performance, indicating robustness to curator capabilities.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Theorem 1: Unbiasedness
  • Theorem 2
  • Theorem 3: Unbiasedness (single-stage)
  • Theorem 4: Unbiasedness (two-stage)
  • Theorem 5: Restatement of Theorem \ref{['thm:regret']}
  • Lemma 1
  • Lemma 2: Unbiasedness
  • Lemma 3: Bounded Second Moment
  • Lemma 4
  • Lemma 5
  • ...and 1 more