HiWET: Hierarchical World-Frame End-Effector Tracking for Long-Horizon Humanoid Loco-Manipulation
Zhanxiang Cao, Liyun Yan, Yang Zhang, Sirui Chen, Jianming Ma, Tianyue Zhan, Shengcheng Fu, Yufei Jia, Cewu Lu, Yue Gao
TL;DR
This paper tackles long-horizon humanoid loco-manipulation by formulating end-effector tracking in the world frame and solving it with a hierarchical RL framework. A high-level world-frame policy plans base motion and end-effector targets, while a low-level tracker executes commands with dynamic stability, guided by a pretrained Kinematic Manifold Prior to stay within valid manipulation configurations. The approach demonstrates precise world-frame tracking in simulation (about 12–13 mm) and robust zero-shot sim-to-real transfer on hardware, outperforming baselines and ablations in both tracking and navigation tasks. By explicitly separating global reasoning from local execution and grounding upper-body motion with KMP, HiWET provides a scalable solution for long-horizon humanoid loco-manipulation with practical implications for robust assistive robots and autonomous manipulation in dynamic environments.
Abstract
Humanoid loco-manipulation requires executing precise manipulation tasks while maintaining dynamic stability amid base motion and impacts. Existing approaches typically formulate commands in body-centric frames, fail to inherently correct cumulative world-frame drift induced by legged locomotion. We reformulate the problem as world-frame end-effector tracking and propose HiWET, a hierarchical reinforcement learning framework that decouples global reasoning from dynamic execution. The high-level policy generates subgoals that jointly optimize end-effector accuracy and base positioning in the world frame, while the low-level policy executes these commands under stability constraints. We introduce a Kinematic Manifold Prior (KMP) that embeds the manipulation manifold into the action space via residual learning, reducing exploration dimensionality and mitigating kinematically invalid behaviors. Extensive simulation and ablation studies demonstrate that HiWET achieves precise and stable end-effector tracking in long-horizon world-frame tasks. We validate zero-shot sim-to-real transfer of the low-level policy on a physical humanoid, demonstrating stable locomotion under diverse manipulation commands. These results indicate that explicit world-frame reasoning combined with hierarchical control provides an effective and scalable solution for long-horizon humanoid loco-manipulation.
