Toward Understanding Key Estimation in Learning Robust Humanoid Locomotion

Zhicheng Wang; Wandi Wei; Ruiqi Yu; Jun Wu; Qiuguo Zhu

Toward Understanding Key Estimation in Learning Robust Humanoid Locomotion

Zhicheng Wang, Wandi Wei, Ruiqi Yu, Jun Wu, Qiuguo Zhu

TL;DR

This paper addresses how the choice of state estimations influences learning-based humanoid locomotion by applying saliency analysis to quantify the importance of explicit versus implicit estimation variables. It introduces an asymmetric actor–critic framework with an encoder–decoder module that generates both explicit estimations and latent encodings, trained with PPO and a composite reward promoting base command tracking, gait, and energy efficiency. Key findings show velocity estimation is the most critical factor, heightmap information substantially aids terrain handling, and implicit encoding can improve sim-to-real transfer, with the best robustness achieved when selecting a compact set of estimations (Key2). The approach demonstrates real-world walking capabilities on stairs, slopes, and uneven terrain, offering practical guidance for estimation design in humanoid control and highlighting the complementary roles of explicit and implicit information for robust robotic locomotion.

Abstract

Accurate state estimation plays a critical role in ensuring the robust control of humanoid robots, particularly in the context of learning-based control policies for legged robots. However, there is a notable gap in analytical research concerning estimations. Therefore, we endeavor to further understand how various types of estimations influence the decision-making processes of policies. In this paper, we provide quantitative insight into the effectiveness of learned state estimations, employing saliency analysis to identify key estimation variables and optimize their combination for humanoid locomotion tasks. Evaluations assessing tracking precision and robustness are conducted on comparative groups of policies with varying estimation combinations in both simulated and real-world environments. Results validated that the proposed policy is capable of crossing the sim-to-real gap and demonstrating superior performance relative to alternative policy configurations.

Toward Understanding Key Estimation in Learning Robust Humanoid Locomotion

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 7 figures, 3 tables)

This paper contains 17 sections, 6 equations, 7 figures, 3 tables.

INTRODUCTION
METHODOLOGY
Asymmetric Policy Architecture
State and Action
Reward
Base command tracking
Gait
Smoothness and energy saving
Saliency Analysis
Training Environment Design
experiment and result
Platform description
Saliency analysis
Comparison group setup
Performance metrics
...and 2 more sections

Figures (7)

Figure 1: Overview of key estimation policy. By quantifying the importance of the explicit estimation states and designing the key estimation architecture, the policy achieves real-world blind locomotion with a real Wukong-IV humanoid.
Figure 2: Architecture of the proposed policy. The actor consists of an auto-encode $(\mu, \eta)$ and a backbone policy $\psi$. The encoder $\mu$ takes in 0.5s-long historical proprioception and generates estimations, implicit encoding ${z_t}$ and explicit encoding ${\hat{e}_t}$. The decoder $\eta$ reconstruct the current proprioception using ${z_t}$ and ${\hat{e}_t}$. The ${\hat{e}_t}$ is fitted to the true values of corresponding physical variables ${e_t}$. Both encodings, along with current proprioception ${o_t}$, serve as input to the backbone policy, resulting in actions. The critic's input includes ${o_t}$ and privileged information ${P_t}$ that includes ${e_t}$ and other useful data.
Figure 3: Saliency analysis of the estimation states. (A) Pie chart of the estimations' average relative importance. (B) Box plot of the ranges of the relative importance for all samples. The colored box refers to the range between 25% and 75% samples, the horizontal line is the median number, and the error bar shows the boundaries where $p<0.05$.
Figure 4: Wukong IV humanoid model. (A) Real-world Wukong-IV humanoid. (B) Simulated model in IsaacGym.
Figure 5: Reward plots of different trials. The solid lines are the filtered mean episode reward, and the shaded area marks the ranges of the episode reward values.
...and 2 more figures

Toward Understanding Key Estimation in Learning Robust Humanoid Locomotion

TL;DR

Abstract

Toward Understanding Key Estimation in Learning Robust Humanoid Locomotion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)