Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts
Onur Celik, Aleksandar Taranovic, Gerhard Neumann
TL;DR
This paper tackles learning diverse skills in reinforcement learning by moving beyond Gaussian policies to a contextual mixture-of-experts (MoE) framework. Each expert encodes a skill as a contextual motion primitive, and the per-expert context distribution pi(c|o), modeled as an energy-based model, supports multi-modal contexts and hard environment bounds, enabling automatic curriculum learning. Training uses a maximum-entropy objective within CEPS, with trust-region updates to stabilize the bi-level optimization of expert policies and context distributions. Empirical results on challenging robot tasks show Di-SkilL can discover and combine diverse, high-performance skills across unseen contexts, often outperforming baselines and requiring fewer samples due to automatic curricula. The work advances multimodal skill acquisition in RL and demonstrates practical gains for adaptive, context-driven control without prior environment bounds.
Abstract
Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose \textbf{Di}verse \textbf{Skil}l \textbf{L}earning (Di-SkilL\footnote{Videos and code are available on the project webpage: \url{https://alrhub.github.io/di-skill-website/}}), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.
