Table of Contents
Fetching ...

Diverse Skill Discovery for Quadruped Robots via Unsupervised Learning

Ruopeng Cui, Yifei Bi, Haojie Luo, Wei Li

TL;DR

The paper tackles unsupervised skill discovery for quadruped locomotion, addressing reward hacking and learning inefficiency that arise in MI-based approaches. It proposes MOD-Skill, which combines a multi-discriminator module that assigns intrinsic rewards across distinct observation subspaces with an Orthogonal Mixture-of-Experts policy that enforces orthogonality among motion features to yield diverse skills. On a 12-DOF Unitree A1, MOD-Skill achieves an 18.3% increase in state-space coverage and demonstrates successful sim-to-real transfer, validating both diversity and robustness. Overall, the approach reduces reliance on task-specific reward design and enhances learning efficiency for rich, controllable locomotion repertoires in legged robots.

Abstract

Reinforcement learning necessitates meticulous reward shaping by specialists to elicit target behaviors, while imitation learning relies on costly task-specific data. In contrast, unsupervised skill discovery can potentially reduce these burdens by learning a diverse repertoire of useful skills driven by intrinsic motivation. However, existing methods exhibit two key limitations: they typically rely on a single policy to master a versatile repertoire of behaviors without modeling the shared structure or distinctions among them, which results in low learning efficiency; moreover, they are susceptible to reward hacking, where the reward signal increases and converges rapidly while the learned skills display insufficient actual diversity. In this work, we introduce an Orthogonal Mixture-of-Experts (OMoE) architecture that prevents diverse behaviors from collapsing into overlapping representations, enabling a single policy to master a wide spectrum of locomotion skills. In addition, we design a multi-discriminator framework in which different discriminators operate on distinct observation spaces, effectively mitigating reward hacking. We evaluated our method on the 12-DOF Unitree A1 quadruped robot, demonstrating a diverse set of locomotion skills. Our experiments demonstrate that the proposed framework boosts training efficiency and yields an 18.3\% expansion in state-space coverage compared to the baseline.

Diverse Skill Discovery for Quadruped Robots via Unsupervised Learning

TL;DR

The paper tackles unsupervised skill discovery for quadruped locomotion, addressing reward hacking and learning inefficiency that arise in MI-based approaches. It proposes MOD-Skill, which combines a multi-discriminator module that assigns intrinsic rewards across distinct observation subspaces with an Orthogonal Mixture-of-Experts policy that enforces orthogonality among motion features to yield diverse skills. On a 12-DOF Unitree A1, MOD-Skill achieves an 18.3% increase in state-space coverage and demonstrates successful sim-to-real transfer, validating both diversity and robustness. Overall, the approach reduces reliance on task-specific reward design and enhances learning efficiency for rich, controllable locomotion repertoires in legged robots.

Abstract

Reinforcement learning necessitates meticulous reward shaping by specialists to elicit target behaviors, while imitation learning relies on costly task-specific data. In contrast, unsupervised skill discovery can potentially reduce these burdens by learning a diverse repertoire of useful skills driven by intrinsic motivation. However, existing methods exhibit two key limitations: they typically rely on a single policy to master a versatile repertoire of behaviors without modeling the shared structure or distinctions among them, which results in low learning efficiency; moreover, they are susceptible to reward hacking, where the reward signal increases and converges rapidly while the learned skills display insufficient actual diversity. In this work, we introduce an Orthogonal Mixture-of-Experts (OMoE) architecture that prevents diverse behaviors from collapsing into overlapping representations, enabling a single policy to master a wide spectrum of locomotion skills. In addition, we design a multi-discriminator framework in which different discriminators operate on distinct observation spaces, effectively mitigating reward hacking. We evaluated our method on the 12-DOF Unitree A1 quadruped robot, demonstrating a diverse set of locomotion skills. Our experiments demonstrate that the proposed framework boosts training efficiency and yields an 18.3\% expansion in state-space coverage compared to the baseline.
Paper Structure (13 sections, 10 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 10 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed MOD-Skill framework. The policy is optimized through reinforcement learning, and the discriminators are updated through supervised learning, with both cooperating to facilitate skill discovery.
  • Figure 2: Reward curves for algorithm SD1, SD2, SD3, and MD. The top subplot shows the full episode reward, while the bottom subplot presents only the skill reward component.
  • Figure 3: State Space Coverage. We roll out the skills discovered by each algorithm for 20 seconds, collect their linear velocity, angular velocity, and projected gravity observations, and normalize them by dimension.
  • Figure 4: Real-world experiment. The learned skills are deployed on a Unitree A1 robot, demonstrating reliable and robust execution in real-world environments.
  • Figure 5: Reward curves for algorithm OMoE, MoE, and MLP.