Table of Contents
Fetching ...

MS-PPO: Morphological-Symmetry-Equivariant Policy for Legged Robot Locomotion

Sizhe Wei, Xulin Chen, Fengze Xie, Garrett Ethan Katz, Zhenyu Gan, Lu Gan

TL;DR

MS-PPO addresses morphology- and symmetry-agnostic policies in legged RL by embedding the robot's kinematic graph and morphological symmetry into a graph neural policy (actor) and an invariant value function (critic). The approach uses a morphology-symmetry-equivariant GNN for the policy and a symmetry-invariant GNN for the value, achieving superior symmetry generalization, training stability, and sample efficiency across quadruped platforms, with successful sim-to-real deployment. It eliminates reliance on reward shaping or data augmentation for symmetry and demonstrates robustness across gait types and terrains. The work provides a principled inductive bias for legged locomotion control and shares code publicly.

Abstract

Reinforcement learning has recently enabled impressive locomotion capabilities on legged robots; however, most policy architectures remain morphology- and symmetry-agnostic, leading to inefficient training and limited generalization. This work introduces MS-PPO, a morphological-symmetry-equivariant policy learning framework that encodes robot kinematic structure and morphological symmetries directly into the policy network. We construct a morphology-informed graph neural architecture that is provably equivariant with respect to the robot's morphological symmetry group actions, ensuring consistent policy responses under symmetric states while maintaining invariance in value estimation. This design eliminates the need for tedious reward shaping or costly data augmentation, which are typically required to enforce symmetry. We evaluate MS-PPO in simulation on Unitree Go2 and Xiaomi CyberDog2 robots across diverse locomotion tasks, including trotting, pronking, slope walking, and bipedal turning, and further deploy the learned policies on hardware. Extensive experiments show that MS-PPO achieves superior training stability, symmetry generalization ability, and sample efficiency in challenging locomotion tasks, compared to state-of-the-art baselines. These findings demonstrate that embedding both kinematic structure and morphological symmetry into policy learning provides a powerful inductive bias for legged robot locomotion control. Our code will be made publicly available at https://lunarlab-gatech.github.io/MS-PPO/.

MS-PPO: Morphological-Symmetry-Equivariant Policy for Legged Robot Locomotion

TL;DR

MS-PPO addresses morphology- and symmetry-agnostic policies in legged RL by embedding the robot's kinematic graph and morphological symmetry into a graph neural policy (actor) and an invariant value function (critic). The approach uses a morphology-symmetry-equivariant GNN for the policy and a symmetry-invariant GNN for the value, achieving superior symmetry generalization, training stability, and sample efficiency across quadruped platforms, with successful sim-to-real deployment. It eliminates reliance on reward shaping or data augmentation for symmetry and demonstrates robustness across gait types and terrains. The work provides a principled inductive bias for legged locomotion control and shares code publicly.

Abstract

Reinforcement learning has recently enabled impressive locomotion capabilities on legged robots; however, most policy architectures remain morphology- and symmetry-agnostic, leading to inefficient training and limited generalization. This work introduces MS-PPO, a morphological-symmetry-equivariant policy learning framework that encodes robot kinematic structure and morphological symmetries directly into the policy network. We construct a morphology-informed graph neural architecture that is provably equivariant with respect to the robot's morphological symmetry group actions, ensuring consistent policy responses under symmetric states while maintaining invariance in value estimation. This design eliminates the need for tedious reward shaping or costly data augmentation, which are typically required to enforce symmetry. We evaluate MS-PPO in simulation on Unitree Go2 and Xiaomi CyberDog2 robots across diverse locomotion tasks, including trotting, pronking, slope walking, and bipedal turning, and further deploy the learned policies on hardware. Extensive experiments show that MS-PPO achieves superior training stability, symmetry generalization ability, and sample efficiency in challenging locomotion tasks, compared to state-of-the-art baselines. These findings demonstrate that embedding both kinematic structure and morphological symmetry into policy learning provides a powerful inductive bias for legged robot locomotion control. Our code will be made publicly available at https://lunarlab-gatech.github.io/MS-PPO/.

Paper Structure

This paper contains 17 sections, 1 theorem, 12 equations, 6 figures, 3 tables.

Key Result

Theorem 1

Let $h$ be the encoder defined in Eq. eq:h_defxie2025_mshgnn, and let $\ell_I$ be a feature mapping invariant with respect to the geometric symmetry. Define $f_{\mathcal{G}_I}(x)=\ell_I\!(z_{\mathcal{G}}(h(x)))$, where $z_{\mathcal{G}}$ is a geometric-symmetry-equivariant GNN. Then $f_{\mathcal{G}_I

Figures (6)

  • Figure 1: Overview of MS-PPO: a graph-based policy that encodes both kinematic structure and morphological symmetry through a morphological-symmetry-equivariant GNN actor and -invariant GNN critic, yielding improved symmetry generalization, training stability, and sample efficiency for various locomotion tasks on Unitree Go2 and Xiaomi CyberDog2.
  • Figure 2: Overall framework of MS-PPO. Common and privileged observations are first converted into a graph data structure based on the robot's kinematic structure. The MS-GNN-Equ actor network receives common observations and outputs the mean target joint positions, while the MS-GNN-Inv critic network takes both common and privileged observations as input to predict the state value. By design, MS-PPO encodes both kinematic structure and morphological symmetry in policy learning.
  • Figure 3: Simulation environments for four locomotion tasks on two quadrupedal robots, and hardware deployment on Go2.
  • Figure 4: Velocity tracking performance of four policies on the Walk-to-One-Side task in simulation. Results are averaged over 100 test episodes. Dashed lines indicate commanded velocities, solid lines show the mean tracked velocities, and shaded regions denote one standard deviation. Due to incompatibility with gait-related rewards, PPO-EMLP$^\ast$ is trained with $c_x = 0$$\mathrm{m/s}$, causing its $v_x$ to remain near zero. In both trotting and pronking simulations, MS-PPO converges to the target velocities faster and achieves more accurate tracking than the baseline methods.
  • Figure 5: Training rewards across four tasks. MI-PPO fails on Stand-and-Turn task, while PPO-EMLP$^\ast$ is trained with $c_x=0$ m/s on Walk-to-One-Side trotting task. In contrast, MS-PPO successfully learns all four tasks, achieving faster convergence than MI-PPO and convergence comparable to or better than PPO-MLP, showing improved training stability and sample efficiency.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1: MS-GNN-Inv