Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Zhengyao Gu; Jonathan Light; Raul Astudillo; Ziyu Ye; Langzhou He; Henry Peng Zou; Wei Cheng; Santiago Paternain; Philip S. Yu; Yisong Yue

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue

TL;DR

This work proposes ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs), which learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement.

Abstract

Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

TL;DR

Abstract

Paper Structure (90 sections, 11 theorems, 100 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 90 sections, 11 theorems, 100 equations, 9 figures, 4 tables, 2 algorithms.

Introduction
Problem formulation
RL post-training setting
Training dynamics
Curriculum learning as problem selection
Challenges of adaptive problem selection
Method
Overview of the training loop
Policy improvement as the curator learning signal
Performance improvement identity.
Per-problem utility.
Tabular OSMD formulation
Utility estimation and OSMD bandit feedback.
OSMD update
Function approximation for curator training
...and 75 more sections

Key Result

Theorem 1

$\mathbb{E}[\hat{U}_{\boldsymbol{x}}^t]=u_{\boldsymbol{x}}^t$.

Figures (9)

Figure 1: Co-adaptive online training loop of Actor-Curator. At each RL step, a learned curator adaptively selects problems from a large problem bank instead of uniform sampling. The actor is updated on these problems, after which a bandit-style reward based on post-update policy improvement trains the curator. As the actor improves, the curator adapts to prioritize problems that yield the greatest expected performance gains.
Figure 2: Single training iteration of Actor-Curator. At each iteration, a candidate set of problems is sampled from a fixed proposal distribution. The curator reweights this candidate set to select training problems for the actor. After the actor update, per-problem policy improvement is estimated using pre- and post-update policies. The curator observes bandit feedback only on selected problems and is updated using a PPO-style approximation of online stochastic mirror descent.
Figure 3: Training dynamics.Actor-Curator Test performance over training on three datasets, showing faster convergence and higher final accuracy.
Figure 4: Training speed-up.Actor-Curator attains high test accuracy with significantly fewer steps.
Figure 7: Ablation results. (a) Actor-Curator is compatible with alternative actor update methods such as GRPO and yields consistent performance gains. (b) The combination of the OSMD curator objective and the policy-improvement target achieves superior performance compared to alternative targets and loss functions. (c) Varying curator model size leads to similar long-term performance, indicating robustness to curator capabilities.
...and 4 more figures

Theorems & Definitions (11)

Theorem 1: Unbiasedness
Theorem 2
Theorem 3: Unbiasedness (single-stage)
Theorem 4: Unbiasedness (two-stage)
Theorem 5: Restatement of Theorem \ref{['thm:regret']}
Lemma 1
Lemma 2: Unbiasedness
Lemma 3: Bounded Second Moment
Lemma 4
Lemma 5
...and 1 more

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

TL;DR

Abstract

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (11)