Table of Contents
Fetching ...

MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning

Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, Huazhe Xu

TL;DR

MENTOR introduces a Mixture-of-Experts backbone to visual reinforcement learning to reduce gradient conflicts, paired with a task-oriented perturbation strategy that samples from top-performing agents to guide exploration. The approach yields superior sample efficiency and state-of-the-art results across three simulation benchmarks and three challenging real-world robotic tasks, achieving an average 83% success rate versus 32% for strong baselines. The paper demonstrates MoE's advantages in multi-task and multi-stage settings and shows robustness to disturbances in real-world manipulation. Together, these contributions push toward more practical, data-efficient visual RL for real-world robotics.

Abstract

Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone and introduces a task-oriented perturbation mechanism. MENTOR outperforms state-of-the-art methods across three simulation benchmarks and achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks, significantly surpassing the 32% success rate of the strongest existing model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at https://suninghuang19.github.io/mentor_page/.

MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning

TL;DR

MENTOR introduces a Mixture-of-Experts backbone to visual reinforcement learning to reduce gradient conflicts, paired with a task-oriented perturbation strategy that samples from top-performing agents to guide exploration. The approach yields superior sample efficiency and state-of-the-art results across three simulation benchmarks and three challenging real-world robotic tasks, achieving an average 83% success rate versus 32% for strong baselines. The paper demonstrates MoE's advantages in multi-task and multi-stage settings and shows robustness to disturbances in real-world manipulation. Together, these contributions push toward more practical, data-efficient visual RL for real-world robotics.

Abstract

Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone and introduces a task-oriented perturbation mechanism. MENTOR outperforms state-of-the-art methods across three simulation benchmarks and achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks, significantly surpassing the 32% success rate of the strongest existing model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at https://suninghuang19.github.io/mentor_page/.

Paper Structure

This paper contains 28 sections, 12 equations, 15 figures, 6 tables, 2 algorithms.

Figures (15)

  • Figure 2: Overview. MENTOR uses an MoE backbone with a CNN encoder to process visual inputs. A router selects and weights the relevant experts based on the inputs to generate the final actions. In addition to regular reinforcement learning updates, periodic task-oriented perturbations are applied during training by sampling from top-performing agents to adjust the current agent’s weights.
  • Figure 3: MoE in multi-task scenarios. Left: Expert usage intensity distribution of the MoE agent in opposing tasks. Right: Gradient conflict among opposing tasks for both MLP and MoE agents. The MLP agent frequently encounters gradient conflicts (indicated by negative cosine similarity) when learning multiple skills, while the MoE agent avoids these conflicts (indicated by positive values). We also provide a comparison of gradient conflicts for MLP and MoE agents in single-task settings, as detailed in Appendix \ref{['app:multi-stage-gradient']}.
  • Figure 4: MoE in multi-stage scenarios. We present the expert usage intensity during the Assembly task in Meta-World. While Expert 15 remains highly active throughout the entire process, other experts are activated with varying intensity over time, automatically dividing the task into four distinct stages.
  • Figure 5: Validation of task-oriented perturbation on Hopper Hop (a MENTOR, DrM, and DrQ-v2 agent trained on the Hopper Hop task during the first 1M frames). Our method consistently achieves higher episode rewards with a consistently lower dormant ratio. (c) shows the episode reward obtained by perturbation candidate $\phi$ sampled from $\Phi_{\text{oriented}}$ steadily increases and occasionally surpasses that of the corresponding RL agent (replotted as the light red line), whereas in DrM, the reward remains at zero due to the use of randomly generated perturbation parameters.
  • Figure 6: Performance comparisons in simulations. This figure compares the performance of our method to DrM, DrQ-v2, ALIX, and TACO across 12 tasks with four random seeds in three different benchmarks (DMC, MW, and Adroit). The shaded region indicates standard deviation in DMC and the range of success rates in MW and Adroit.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Definition 2.1
  • Definition 2.2