Table of Contents
Fetching ...

Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement Learning

Jie Ren, Yewen Li, Zihan Ding, Wei Pan, Hao Dong

TL;DR

The paper tackles the limitation of unimodal policies in DRL by introducing a differentiable probabilistic mixture-of-experts (PMOE) policy based on a Gaussian mixture model. A novel Frequency Approximate Gradient is proposed to solve the indifferentiability of routing among multiple primitives, enabling end-to-end learning with off-policy and on-policy algorithms like SAC and PPO. Empirical results on six MuJoCo tasks show PMOE achieving substantial improvements in AUC over unimodal policies and other MOE baselines, along with analyses demonstrating distinguishable primitives and enhanced exploration. The approach is shown to be robust to noise and sensitive to the number of primitives, offering practical benefits for efficient exploration and skill discovery in continuous control tasks.

Abstract

Deep reinforcement learning (DRL) has successfully solved various problems recently, typically with a unimodal policy representation. However, grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance, which may lead to a multimodal policy represented as a mixture-of-experts (MOE). To our best knowledge, present DRL algorithms for general utility do not deploy this method as policy function approximators due to the potential challenge in its differentiability for policy learning. In this work, we propose a probabilistic mixture-of-experts (PMOE) implemented with a Gaussian mixture model (GMM) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem, which can be applied in generic off-policy and on-policy DRL algorithms using stochastic policies, e.g., Soft Actor-Critic (SAC) and Proximal Policy Optimisation (PPO). Experimental results testify the advantage of our method over unimodal polices and two different MOE methods, as well as a method of option frameworks, based on the above two types of DRL algorithms, on six MuJoCo tasks. Different gradient estimations for GMM like the reparameterisation trick (Gumbel-Softmax) and the score-ratio trick are also compared with our method. We further empirically demonstrate the distinguishable primitives learned with PMOE and show the benefits of our method in terms of exploration.

Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement Learning

TL;DR

The paper tackles the limitation of unimodal policies in DRL by introducing a differentiable probabilistic mixture-of-experts (PMOE) policy based on a Gaussian mixture model. A novel Frequency Approximate Gradient is proposed to solve the indifferentiability of routing among multiple primitives, enabling end-to-end learning with off-policy and on-policy algorithms like SAC and PPO. Empirical results on six MuJoCo tasks show PMOE achieving substantial improvements in AUC over unimodal policies and other MOE baselines, along with analyses demonstrating distinguishable primitives and enhanced exploration. The approach is shown to be robust to noise and sensitive to the number of primitives, offering practical benefits for efficient exploration and skill discovery in continuous control tasks.

Abstract

Deep reinforcement learning (DRL) has successfully solved various problems recently, typically with a unimodal policy representation. However, grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance, which may lead to a multimodal policy represented as a mixture-of-experts (MOE). To our best knowledge, present DRL algorithms for general utility do not deploy this method as policy function approximators due to the potential challenge in its differentiability for policy learning. In this work, we propose a probabilistic mixture-of-experts (PMOE) implemented with a Gaussian mixture model (GMM) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem, which can be applied in generic off-policy and on-policy DRL algorithms using stochastic policies, e.g., Soft Actor-Critic (SAC) and Proximal Policy Optimisation (PPO). Experimental results testify the advantage of our method over unimodal polices and two different MOE methods, as well as a method of option frameworks, based on the above two types of DRL algorithms, on six MuJoCo tasks. Different gradient estimations for GMM like the reparameterisation trick (Gumbel-Softmax) and the score-ratio trick are also compared with our method. We further empirically demonstrate the distinguishable primitives learned with PMOE and show the benefits of our method in terms of exploration.

Paper Structure

This paper contains 26 sections, 1 theorem, 21 equations, 11 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.1

The accumulated frequency approximate gradient is an asymptotically unbiased estimation of the true gradient for the sampling process from a categorical distribution in the routing function, with a batch of $N \to \infty$ samples.

Figures (11)

  • Figure 1: Training curves on MuJoCo benchmarks with SAC-based algorithms. We set PMOE with $K=4$ in all the experiments except HalfCheetah-v2 with $K=2$ and HumanoidStandup-v2 and Humanoid-v2 with $K=10$.
  • Figure 2: Training curves on MuJoCo benchmark with PPO-based algorithms. We set a larger number, $K=16$ for Ant-v2, $K=12$ for Hopper-v2, $K=4$ for Walker2D-v2 and HumainoidStandup-v2, $K=8$ for Humanoid-v2 and $K=8$ for HalfCheetah-v2.
  • Figure 3: Trajectories of the agents with our method and the baselines in the target-reaching environment. We fix the reset locations of target, obstacles and agent. ($a$), ($b$), ($c$) and ($d$) visualise the 10 trajectories collected with methods involving: original SAC, gating operation with SAC, back-propagation-all PMOE (discussed in Sec. \ref{['sec:backprog_all']}) and back-propagation-max PMOE, respectively. ($e$) shows the trajectories collected with two individual primitives with our approach.
  • Figure 4: Visualisation of distinguishable primitives learned with PMOE using t-SNE plot on Hopper-v2 environment. The states are first clustered as in ($b$). Then actions within the same state cluster are plotted with t-SNE as in ($a$) and ($c$) for the gating method and our approach, respectively. Our method clearly demonstrates more distinguishable primitives for the policy.
  • Figure 5: Visualisation of exploration trajectories in the initial training stage for the target-reaching environment. The initial $10K$ steps (the grey region on the learning curves in ($b$)) of exploration trajectories are plotted in ($a$) and ($c$) for our PMOE method (red) and SAC (blue), respectively. The green rectangle is the target region.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Definition 3.1: Frequency Approximate Gradient
  • Theorem 3.1
  • proof