Table of Contents
Fetching ...

Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts

Onur Celik, Aleksandar Taranovic, Gerhard Neumann

TL;DR

This paper tackles learning diverse skills in reinforcement learning by moving beyond Gaussian policies to a contextual mixture-of-experts (MoE) framework. Each expert encodes a skill as a contextual motion primitive, and the per-expert context distribution pi(c|o), modeled as an energy-based model, supports multi-modal contexts and hard environment bounds, enabling automatic curriculum learning. Training uses a maximum-entropy objective within CEPS, with trust-region updates to stabilize the bi-level optimization of expert policies and context distributions. Empirical results on challenging robot tasks show Di-SkilL can discover and combine diverse, high-performance skills across unseen contexts, often outperforming baselines and requiring fewer samples due to automatic curricula. The work advances multimodal skill acquisition in RL and demonstrates practical gains for adaptive, context-driven control without prior environment bounds.

Abstract

Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose \textbf{Di}verse \textbf{Skil}l \textbf{L}earning (Di-SkilL\footnote{Videos and code are available on the project webpage: \url{https://alrhub.github.io/di-skill-website/}}), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.

Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts

TL;DR

This paper tackles learning diverse skills in reinforcement learning by moving beyond Gaussian policies to a contextual mixture-of-experts (MoE) framework. Each expert encodes a skill as a contextual motion primitive, and the per-expert context distribution pi(c|o), modeled as an energy-based model, supports multi-modal contexts and hard environment bounds, enabling automatic curriculum learning. Training uses a maximum-entropy objective within CEPS, with trust-region updates to stabilize the bi-level optimization of expert policies and context distributions. Empirical results on challenging robot tasks show Di-SkilL can discover and combine diverse, high-performance skills across unseen contexts, often outperforming baselines and requiring fewer samples due to automatic curricula. The work advances multimodal skill acquisition in RL and demonstrates practical gains for adaptive, context-driven control without prior environment bounds.

Abstract

Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose \textbf{Di}verse \textbf{Skil}l \textbf{L}earning (Di-SkilL\footnote{Videos and code are available on the project webpage: \url{https://alrhub.github.io/di-skill-website/}}), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.
Paper Structure (36 sections, 33 equations, 27 figures, 7 tables, 2 algorithms)

This paper contains 36 sections, 33 equations, 27 figures, 7 tables, 2 algorithms.

Figures (27)

  • Figure 1: The Sampling Procedure for Di-SkilL. During Inference the agent observes contexts $\boldsymbol{\mathrm{c}}$ from the environment's unknown context distribution ${p}\left(\boldsymbol{\mathrm{c}}\right)$. The agent calculates the gating probabilities ${\pi}(o|\boldsymbol{\mathrm{c}})$ for each context and samples an expert $o$ resulting in $(o, \boldsymbol{\mathrm{c}}$) samples marked in blue. During Training we first sample a batch of contexts $\boldsymbol{\mathrm{c}}$ from ${p}\left(\boldsymbol{\mathrm{c}}\right)$, which is used to calculate the per-expert context distribution ${\pi}(\boldsymbol{\mathrm{c}}|o)$ for each expert $o = 1,..., K$. The ${\pi}(\boldsymbol{\mathrm{c}}|o)$ provides a higher probability for contexts preferred by the expert ${\pi}(\boldsymbol{\mathrm{\theta}} | \boldsymbol{\mathrm{c}}, o)$. To enable curriculum learning, we provide each expert the contexts sampled from its corresponding ${\pi}(\boldsymbol{\mathrm{c}}|o)$, resulting in the samples $(o, \boldsymbol{\mathrm{c}}_T)$ marked in orange. In both cases, the chosen ${\pi}(\boldsymbol{\mathrm{\theta}} | \boldsymbol{\mathrm{c}}, o)$ samples motion primitive parameters $\boldsymbol{\mathrm{\theta}}$ for each context, resulting in a trajectory $\tau$ that is subsequently executed on the environment. Before execution, the corresponding context, e.g., the goal position of a box, needs to be set in the environment. This is illustrated by the dashed arrows, with the context in blue for inference and orange for training.
  • Figure 2:
  • Figure 3:
  • Figure 4:
  • Figure 5:
  • ...and 22 more figures