Table of Contents
Fetching ...

On the Benefit of Optimal Transport for Curriculum Reinforcement Learning

Pascal Klink, Carlo D'Eramo, Jan Peters, Joni Pajarinen

TL;DR

This paper reframes curriculum reinforcement learning as constrained optimal transport between task distributions to ensure gradual, geometry-aware progression of task difficulty. By replacing KL-based similarity and/ or pure performance constraints with a Wasserstein-OT formulation, the authors introduce currot, a curriculum method that concentrates probability mass on contexts meeting a performance threshold, and compare it to gradient, which relies on Wasserstein barycenters between initial and target distributions. Through theoretical discussion and extensive experiments across discrete and continuous context spaces, the work demonstrates that OT-based curricula yield faster and more reliable learning, especially in settings with infeasible target tasks or non-Gaussian task distributions. The results highlight the importance of explicit task similarity measures and adaptive constraint handling, and point to future directions in learned distance metrics and hybrid adaptive curricula.

Abstract

Curriculum reinforcement learning (CRL) allows solving complex tasks by generating a tailored sequence of learning tasks, starting from easy ones and subsequently increasing their difficulty. Although the potential of curricula in RL has been clearly shown in various works, it is less clear how to generate them for a given learning environment, resulting in various methods aiming to automate this task. In this work, we focus on framing curricula as interpolations between task distributions, which has previously been shown to be a viable approach to CRL. Identifying key issues of existing methods, we frame the generation of a curriculum as a constrained optimal transport problem between task distributions. Benchmarks show that this way of curriculum generation can improve upon existing CRL methods, yielding high performance in various tasks with different characteristics.

On the Benefit of Optimal Transport for Curriculum Reinforcement Learning

TL;DR

This paper reframes curriculum reinforcement learning as constrained optimal transport between task distributions to ensure gradual, geometry-aware progression of task difficulty. By replacing KL-based similarity and/ or pure performance constraints with a Wasserstein-OT formulation, the authors introduce currot, a curriculum method that concentrates probability mass on contexts meeting a performance threshold, and compare it to gradient, which relies on Wasserstein barycenters between initial and target distributions. Through theoretical discussion and extensive experiments across discrete and continuous context spaces, the work demonstrates that OT-based curricula yield faster and more reliable learning, especially in settings with infeasible target tasks or non-Gaussian task distributions. The results highlight the importance of explicit task similarity measures and adaptive constraint handling, and point to future directions in learned distance metrics and hybrid adaptive curricula.

Abstract

Curriculum reinforcement learning (CRL) allows solving complex tasks by generating a tailored sequence of learning tasks, starting from easy ones and subsequently increasing their difficulty. Although the potential of curricula in RL has been clearly shown in various works, it is less clear how to generate them for a given learning environment, resulting in various methods aiming to automate this task. In this work, we focus on framing curricula as interpolations between task distributions, which has previously been shown to be a viable approach to CRL. Identifying key issues of existing methods, we frame the generation of a curriculum as a constrained optimal transport problem between task distributions. Benchmarks show that this way of curriculum generation can improve upon existing CRL methods, yielding high performance in various tasks with different characteristics.
Paper Structure (30 sections, 32 equations, 23 figures, 3 tables, 2 algorithms)

This paper contains 30 sections, 32 equations, 23 figures, 3 tables, 2 algorithms.

Figures (23)

  • Figure 1: Our approach (currot) addresses problems of existing curriculum RL methods, such as sprl, which create curricula between a distribution of initial tasks (blue) and a distribution of target tasks (green). In this example, the curriculum can change the task via two parameters $c_1$ and $c_2$, leading to more or less challenging learning environments for an agent. Looking at the different stages of the curricula (colored points), we see that existing methods can lead to distributions that encode hard- and easy tasks, but ignore tasks of intermediate difficulty. Our method avoids such a splitting behavior, resulting in interpolations that gradually increase the task difficulty throughout the curriculum. Please see Sections \ref{['sec:currot:crlasot']} and \ref{['sec:currot:approx_currot']} for a detailed description.
  • Figure 2: Interpolations generated by optimizing Objective (\ref{['eq:currot:kl-interp']}) for different values of $\epsilon$ (and with that $\alpha$). In the top row, $p_1(c)$ and $p_2(c)$ are Gaussian, while in the bottom row, they assign uniform density over different parts of $\mathcal{C}$.
  • Figure 3: Wasserstein barycenters $\mathcal{B}([\alpha, 1{-}\alpha], [p_1, p_2])$ between the distributions shown in Figure \ref{['fig:currot:kl_interpolations']}. In the top row, $p_1(c)$ and $p_2(c)$ are Gaussian while in the bottom row, they assign uniform density over different parts of $\mathcal{C}$.
  • Figure 4: Interpolations using KL divergence (top) and Wasserstein distance (bottom) subject to an expected performance constraint with different threshold values $\delta$. The performance $J(\pi, c)$ is visualized in green.
  • Figure 5: Interpolations generated by gradient (Eq. \ref{['eq:currot:gradient']}, top) and currot (Eq. \ref{['eq:currot:currot']}, bottom) for different threshold values $\delta$. The performance $J(\pi, c)$ is visualized in green.
  • ...and 18 more figures