Lifelike Agility and Play in Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models

Lei Han; Qingxu Zhu; Jiapeng Sheng; Chong Zhang; Tingguang Li; Yizheng Zhang; He Zhang; Yuzhen Liu; Cheng Zhou; Rui Zhao; Jie Li; Yufeng Zhang; Rui Wang; Wanchao Chi; Xiong Li; Yonghui Zhu; Lingzhu Xiang; Xiao Teng; Zhengyou Zhang

Lifelike Agility and Play in Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models

Lei Han, Qingxu Zhu, Jiapeng Sheng, Chong Zhang, Tingguang Li, Yizheng Zhang, He Zhang, Yuzhen Liu, Cheng Zhou, Rui Zhao, Jie Li, Yufeng Zhang, Rui Wang, Wanchao Chi, Xiong Li, Yonghui Zhu, Lingzhu Xiang, Xiao Teng, Zhengyou Zhang

TL;DR

This work introduces a hierarchical, pre-trained control framework for quadrupedal robots that separates knowledge into primitive, environmental, and strategic levels. By training a Vector Quantized Primitive Motor Controller (VQ-PMC) on animal motion data and reusing its decoder to build Environmental–Primitive (EPMC) and Strategic–EPMC (SEPMC) controllers, the authors achieve lifelike agility and robust task performance on the MAX robot, including a challenging multi-agent chase-tag game. The approach combines imitation learning, distillation, and self-play (PFSP) within a three-stage RL framework, enabling zero-shot transfer to real hardware with domain randomization and onboard-sensing distillation. The results include real-world demonstrations of animal-like movement, complex obstacle traversal, and emergent strategic play, along with extensive ablations and comparisons to concurrent methods, underscoring the framework’s generality and practical impact for robotics.

Abstract

Knowledge from animals and humans inspires robotic innovations. Numerous efforts have been made to achieve agile locomotion in quadrupedal robots through classical controllers or reinforcement learning approaches. These methods usually rely on physical models or handcrafted rewards to accurately describe the specific system, rather than on a generalized understanding like animals do. Here we propose a hierarchical framework to construct primitive-, environmental- and strategic-level knowledge that are all pre-trainable, reusable and enrichable for legged robots. The primitive module summarizes knowledge from animal motion data, where, inspired by large pre-trained models in language and image understanding, we introduce deep generative models to produce motor control signals stimulating legged robots to act like real animals. Then, we shape various traversing capabilities at a higher level to align with the environment by reusing the primitive module. Finally, a strategic module is trained focusing on complex downstream tasks by reusing the knowledge from previous levels. We apply the trained hierarchical controllers to the MAX robot, a quadrupedal robot developed in-house, to mimic animals, traverse complex obstacles and play in a designed challenging multi-agent chase tag game, where lifelike agility and strategy emerge in the robots.

Lifelike Agility and Play in Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models

TL;DR

Abstract

Paper Structure (24 sections, 26 equations, 8 figures, 4 tables)

This paper contains 24 sections, 26 equations, 8 figures, 4 tables.

MAIN
FRAMEWORK OVERVIEW
RESULTS
Primitive Behaviors
Traversing Complex Obstacles
Chase Tag Game
DISCUSSION
METHODS
Primitive-Level Training
Environmental-Level Training
Strategic-Level Training
Transferring to Reality
Supplementary Material
Rewards and Training Details
Rewards and Training Details in Primitive-Level Training
...and 9 more sections

Figures (8)

Figure 1: A framework overview of the proposed method. We initially train a PMC to imitate animal movements using discrete latent embeddings (Stage 1). The decoder of PMC is reused to train environmental-level controllers for general walking, fall recovery, creeping over narrow space, and traversing over hurdles, blocks and stairs separately, which are compressed into a uniform environmental-level controller by multi-expert distillation (Stage 2). At the final stage, we reuse the pre-trained environmental- and primitive-level networks to train a strategic-level network for solving a designed multi-agent chase tag game (Stage 3).
Figure 2: Evaluation of the primitive motor controllers. (A) Snapshots of the MAX robot imitating motion data on different terrains. (B) Comparison of the learning curves of VQ-PMC, and $\beta$-VAE based methods in imitation learning. The experiments are repeated three times to plot the mean curve and the shaded region (standard deviation). (C) Visualization of the generated trajectories from the VQ-PMC network using t-SNE. (D) Gait analysis for the generated movements from the primitive motor controllers. The plots show statistics over an entire walking trajectory with $\sim$1000 frames/samples. The bands indicate the maximum and minimum values. (E) Comparison of the tracking rewards in simulation and reality. For each motion type, the experiments are repeated three times to compute the reward statistics in the real world, while the environment dynamics remain deterministic under this case in simulation and the reward is a deterministic value given the trained policy.
Figure 3: Performance evaluation of the environmental-primitive motor controllers. (A-D) Snapshots for creeping (A), ascending stairs (B), jumping over hurdles (C) and freerunning over blocks (D). (E) Success rate and output torque distribution for three elementary tasks in real-world experiments. Each elementary task is configured with a single corresponding obstacle. Each elementary experiment is repeated for 10 times for success rate statistics. The torque distribution is computed from all samples of the 10 repetitions. (F) Comparison of the effectiveness of reusing different pre-trained primitive-level networks learned by VQ-PMC and $\beta$-VAE based methods. The curves indicate the training of the environmental-level network on flat terrain. The experiments are repeated three times to plot the mean curve and the shaded region (standard deviation). (G) Comparison of the learning curves for different EPMC controllers by reusing the proposed primitive-level networks and training from scratch. All the experiments are repeated three times to plot the mean curve and the shaded region indicating the standard deviation.
Figure 4: Snapshots in the Chase Tag Game. (A) A case in which the chaser, MAX2, gives up chasing MAX1 when MAX2 estimates that it is not possible to catch MAX1 before MAX1 reaches the flag. (B) A case in which the chaser, MAX1, hesitates and wanders around. (C) A case in which the chaser, MAX2, pounces on MAX1. (D) A case in which the evader, MAX1, pounces on the flag. (E) A detailed analysis of the torques, angular velocities, linear velocities and root heights of the two robots in a complete game episode.
Figure :
...and 3 more figures

Lifelike Agility and Play in Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models

TL;DR

Abstract

Lifelike Agility and Play in Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)