Table of Contents
Fetching ...

Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion

Qiang Zhang, Gang Han, Jingkai Sun, Wen Zhao, Chenghao Sun, Jiahang Cao, Jiaxu Wang, Yijie Guo, Renjing Xu

TL;DR

This work tackles perceptive locomotion for humanoid robots by proposing Distillation-PPO (D-PPO), a two-stage RL framework that couples teacher-guided regularization with online reinforcement learning. A teacher policy is trained in a fully observable MDP to provide privileged guidance, while a student policy learns in a partially observable setting, supervised by distillation signals and optimized via PPO, enhanced by domain randomization. The approach integrates an elevation-map–based perception pipeline and a history-state encoder, enabling robust sim-to-real transfer and improved stability in complex terrains. Real-world experiments with a humanoid robot demonstrate superior training efficiency, robustness, and generalization to stairs, slopes, and irregular terrain, illustrating practical impact for agile, perception-driven locomotion. The method is extensible to other legged robots and showcases a principled combination of distillation and reinforcement learning for perceptive control under partial observability.

Abstract

In recent years, humanoid robots have garnered significant attention from both academia and industry due to their high adaptability to environments and human-like characteristics. With the rapid advancement of reinforcement learning, substantial progress has been made in the walking control of humanoid robots. However, existing methods still face challenges when dealing with complex environments and irregular terrains. In the field of perceptive locomotion, existing approaches are generally divided into two-stage methods and end-to-end methods. Two-stage methods first train a teacher policy in a simulated environment and then use distillation techniques, such as DAgger, to transfer the privileged information learned as latent features or actions to the student policy. End-to-end methods, on the other hand, forgo the learning of privileged information and directly learn policies from a partially observable Markov decision process (POMDP) through reinforcement learning. However, due to the lack of supervision from a teacher policy, end-to-end methods often face difficulties in training and exhibit unstable performance in real-world applications. This paper proposes an innovative two-stage perceptive locomotion framework that combines the advantages of teacher policies learned in a fully observable Markov decision process (MDP) to regularize and supervise the student policy. At the same time, it leverages the characteristics of reinforcement learning to ensure that the student policy can continue to learn in a POMDP, thereby enhancing the model's upper bound. Our experimental results demonstrate that our two-stage training framework achieves higher training efficiency and stability in simulated environments, while also exhibiting better robustness and generalization capabilities in real-world applications.

Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion

TL;DR

This work tackles perceptive locomotion for humanoid robots by proposing Distillation-PPO (D-PPO), a two-stage RL framework that couples teacher-guided regularization with online reinforcement learning. A teacher policy is trained in a fully observable MDP to provide privileged guidance, while a student policy learns in a partially observable setting, supervised by distillation signals and optimized via PPO, enhanced by domain randomization. The approach integrates an elevation-map–based perception pipeline and a history-state encoder, enabling robust sim-to-real transfer and improved stability in complex terrains. Real-world experiments with a humanoid robot demonstrate superior training efficiency, robustness, and generalization to stairs, slopes, and irregular terrain, illustrating practical impact for agile, perception-driven locomotion. The method is extensible to other legged robots and showcases a principled combination of distillation and reinforcement learning for perceptive control under partial observability.

Abstract

In recent years, humanoid robots have garnered significant attention from both academia and industry due to their high adaptability to environments and human-like characteristics. With the rapid advancement of reinforcement learning, substantial progress has been made in the walking control of humanoid robots. However, existing methods still face challenges when dealing with complex environments and irregular terrains. In the field of perceptive locomotion, existing approaches are generally divided into two-stage methods and end-to-end methods. Two-stage methods first train a teacher policy in a simulated environment and then use distillation techniques, such as DAgger, to transfer the privileged information learned as latent features or actions to the student policy. End-to-end methods, on the other hand, forgo the learning of privileged information and directly learn policies from a partially observable Markov decision process (POMDP) through reinforcement learning. However, due to the lack of supervision from a teacher policy, end-to-end methods often face difficulties in training and exhibit unstable performance in real-world applications. This paper proposes an innovative two-stage perceptive locomotion framework that combines the advantages of teacher policies learned in a fully observable Markov decision process (MDP) to regularize and supervise the student policy. At the same time, it leverages the characteristics of reinforcement learning to ensure that the student policy can continue to learn in a POMDP, thereby enhancing the model's upper bound. Our experimental results demonstrate that our two-stage training framework achieves higher training efficiency and stability in simulated environments, while also exhibiting better robustness and generalization capabilities in real-world applications.

Paper Structure

This paper contains 13 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We demonstrate the walking capabilities of the humanoid robot Tien Kung on different terrains after being trained with D-PPO in the figure. The top part of the figure shows Tien Kung navigating a platform nearly as high as its shins. The middle part illustrates Tien Kung's ability to walk on a sloped surface and cross a small ditch. The bottom part displays the terrain reconstruction by our perception system and the robot's posture while overcoming obstacles. It is clear that Tien Kung's posture varies across different terrains. On flat ground, Tien Kung maintains a relatively straight-knee posture to achieve a broader field of view. When crossing obstacles, Tien Kung bends its knees to reduce impact and swings its arms to maintain balance.
  • Figure 2: The training framework of Distillation-PPO adopts a symmetric structure for both the teacher and student networks. We did not introduce excessive privileged information into the teacher network because we found that much of the privileged information in simulations is inaccurate and can limit the performance of the student network on actual robots. We extensively use historical information to estimate the state and fully inherit the structure and parameters of the teacher network for initialization, similar to a self-distillation structure. During training, we simulate real-world conditions by increasing the noise in the input information of the student network (domain randomization). In our framework, the distillation loss acts as a regularization term, ensuring that the student network does not deviate during training, while the reinforcement learning loss acts as a reward, ensuring the exploration efficiency and performance upper bound of the student network during training.
  • Figure 3: We provide a detailed demonstration of the visual perception system's performance in estimating posture on elevated terrains and during motion. The figure shows that our system can accurately estimate and reconstruct the terrain. By selecting scan points on this high-precision reconstructed terrain and leveraging the noise robustness trained by D-PPO, our robot can adapt well to various complex terrains.
  • Figure 4: High precision demonstration in terrain reconstruction.
  • Figure 5: The humanoid robot Tien Kung accurately walks on stair terrain and can descend slopes with ease.