Table of Contents
Fetching ...

Proximal Curriculum with Task Correlations for Deep Reinforcement Learning

Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla

TL;DR

ProCuRL-Target addresses efficient curriculum design for contextual multi-task RL by balancing task difficulty and transfer toward a target distribution through task correlations. It derives a gradient-alignment framework in the single-target discrete setting and extends to general target distributions via sampling and softmax selection, formalizing the key score as $Z_t(c) Z_t(c_{\text{targ}}) \langle \psi(c), \psi(c_{\text{targ}}) \rangle$. Empirical results across PM-s, SGR, MiniGrid, and BW show faster convergence and strong performance against baselines, including in bimodal and sparse-target scenarios, while maintaining computational efficiency. The approach is readily integrable with common deep RL algorithms (e.g., PPO) and applicable to arbitrary target distributions, offering practical benefits for scalable, goal-directed curriculum design in real-world settings.

Abstract

Curriculum design for reinforcement learning (RL) can speed up an agent's learning process and help it learn to perform well on complex tasks. However, existing techniques typically require domain-specific hyperparameter tuning, involve expensive optimization procedures for task selection, or are suitable only for specific learning objectives. In this work, we consider curriculum design in contextual multi-task settings where the agent's final performance is measured w.r.t. a target distribution over complex tasks. We base our curriculum design on the Zone of Proximal Development concept, which has proven to be effective in accelerating the learning process of RL agents for uniform distribution over all tasks. We propose a novel curriculum, ProCuRL-Target, that effectively balances the need for selecting tasks that are not too difficult for the agent while progressing the agent's learning toward the target distribution via leveraging task correlations. We theoretically justify the task selection strategy of ProCuRL-Target by analyzing a simple learning setting with REINFORCE learner model. Our experimental results across various domains with challenging target task distributions affirm the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents.

Proximal Curriculum with Task Correlations for Deep Reinforcement Learning

TL;DR

ProCuRL-Target addresses efficient curriculum design for contextual multi-task RL by balancing task difficulty and transfer toward a target distribution through task correlations. It derives a gradient-alignment framework in the single-target discrete setting and extends to general target distributions via sampling and softmax selection, formalizing the key score as . Empirical results across PM-s, SGR, MiniGrid, and BW show faster convergence and strong performance against baselines, including in bimodal and sparse-target scenarios, while maintaining computational efficiency. The approach is readily integrable with common deep RL algorithms (e.g., PPO) and applicable to arbitrary target distributions, offering practical benefits for scalable, goal-directed curriculum design in real-world settings.

Abstract

Curriculum design for reinforcement learning (RL) can speed up an agent's learning process and help it learn to perform well on complex tasks. However, existing techniques typically require domain-specific hyperparameter tuning, involve expensive optimization procedures for task selection, or are suitable only for specific learning objectives. In this work, we consider curriculum design in contextual multi-task settings where the agent's final performance is measured w.r.t. a target distribution over complex tasks. We base our curriculum design on the Zone of Proximal Development concept, which has proven to be effective in accelerating the learning process of RL agents for uniform distribution over all tasks. We propose a novel curriculum, ProCuRL-Target, that effectively balances the need for selecting tasks that are not too difficult for the agent while progressing the agent's learning toward the target distribution via leveraging task correlations. We theoretically justify the task selection strategy of ProCuRL-Target by analyzing a simple learning setting with REINFORCE learner model. Our experimental results across various domains with challenging target task distributions affirm the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents.
Paper Structure (19 sections, 2 theorems, 18 equations, 3 figures)

This paper contains 19 sections, 2 theorems, 18 equations, 3 figures.

Key Result

Theorem 1

Consider Algorithm alg:interaction with the Reinforce learner model and the curriculum strategy defined in Eq. eq:curr-gradient-form. Then, after $t = \mathcal{O}\left({\log \frac{1}{\epsilon}}\right)$ steps, we have: where $V^*(c) := \max_\pi V^\pi(c)$.

Figures (3)

  • Figure 1: (a) provides a comprehensive overview of the complexity of the environments based on the reward signals, context space, state space, action space, and target distribution. (b) showcases the environments by providing an illustrative visualization of each environment (from left to right): PM-s, SGR, MiniG, and BW.
  • Figure 2: Performance comparison of RL agents trained using different curriculum strategies. The performance is measured as the mean return ($\pm 1$ standard error) on the test pool of tasks. The results are averaged over $25$ random seeds for PM-s:1T, $25$ random seeds for PM-s:2G, $10$ random seeds for SGR, $20$ random seeds for MiniG, and $10$ random seeds for BW. The plots are smoothed across $2$ evaluation snapshots that occur over $25000$ training steps.
  • Figure 3: (a-b) present the average distance between the selected contexts C-GatePosition and C-GateWidth and the target distribution for PM-s:2G. (c) presents the two-dimensional context space of PM-s:2G. The target distribution is depicted as a black x and encodes the two gates with C-GateWidth$=0.5$ at C-GatePosition$=\{-3.9, 3.9\}$. Each colored dot represents the context/task selected by ProCuRL-Target during training, where brighter colors indicate later training stages. (d) presents the average C-Tolerance of the selected tasks during different curriculum strategies for SGR. (e) presents the two-dimensional context space of BW. The target distribution is uniform. Each colored dot represents the context/task selected by ProCuRL-Target during training, where brighter colors indicate later training stages.

Theorems & Definitions (4)

  • Theorem 1
  • Proposition 1
  • proof
  • proof