Table of Contents
Fetching ...

HiWET: Hierarchical World-Frame End-Effector Tracking for Long-Horizon Humanoid Loco-Manipulation

Zhanxiang Cao, Liyun Yan, Yang Zhang, Sirui Chen, Jianming Ma, Tianyue Zhan, Shengcheng Fu, Yufei Jia, Cewu Lu, Yue Gao

TL;DR

This paper tackles long-horizon humanoid loco-manipulation by formulating end-effector tracking in the world frame and solving it with a hierarchical RL framework. A high-level world-frame policy plans base motion and end-effector targets, while a low-level tracker executes commands with dynamic stability, guided by a pretrained Kinematic Manifold Prior to stay within valid manipulation configurations. The approach demonstrates precise world-frame tracking in simulation (about 12–13 mm) and robust zero-shot sim-to-real transfer on hardware, outperforming baselines and ablations in both tracking and navigation tasks. By explicitly separating global reasoning from local execution and grounding upper-body motion with KMP, HiWET provides a scalable solution for long-horizon humanoid loco-manipulation with practical implications for robust assistive robots and autonomous manipulation in dynamic environments.

Abstract

Humanoid loco-manipulation requires executing precise manipulation tasks while maintaining dynamic stability amid base motion and impacts. Existing approaches typically formulate commands in body-centric frames, fail to inherently correct cumulative world-frame drift induced by legged locomotion. We reformulate the problem as world-frame end-effector tracking and propose HiWET, a hierarchical reinforcement learning framework that decouples global reasoning from dynamic execution. The high-level policy generates subgoals that jointly optimize end-effector accuracy and base positioning in the world frame, while the low-level policy executes these commands under stability constraints. We introduce a Kinematic Manifold Prior (KMP) that embeds the manipulation manifold into the action space via residual learning, reducing exploration dimensionality and mitigating kinematically invalid behaviors. Extensive simulation and ablation studies demonstrate that HiWET achieves precise and stable end-effector tracking in long-horizon world-frame tasks. We validate zero-shot sim-to-real transfer of the low-level policy on a physical humanoid, demonstrating stable locomotion under diverse manipulation commands. These results indicate that explicit world-frame reasoning combined with hierarchical control provides an effective and scalable solution for long-horizon humanoid loco-manipulation.

HiWET: Hierarchical World-Frame End-Effector Tracking for Long-Horizon Humanoid Loco-Manipulation

TL;DR

This paper tackles long-horizon humanoid loco-manipulation by formulating end-effector tracking in the world frame and solving it with a hierarchical RL framework. A high-level world-frame policy plans base motion and end-effector targets, while a low-level tracker executes commands with dynamic stability, guided by a pretrained Kinematic Manifold Prior to stay within valid manipulation configurations. The approach demonstrates precise world-frame tracking in simulation (about 12–13 mm) and robust zero-shot sim-to-real transfer on hardware, outperforming baselines and ablations in both tracking and navigation tasks. By explicitly separating global reasoning from local execution and grounding upper-body motion with KMP, HiWET provides a scalable solution for long-horizon humanoid loco-manipulation with practical implications for robust assistive robots and autonomous manipulation in dynamic environments.

Abstract

Humanoid loco-manipulation requires executing precise manipulation tasks while maintaining dynamic stability amid base motion and impacts. Existing approaches typically formulate commands in body-centric frames, fail to inherently correct cumulative world-frame drift induced by legged locomotion. We reformulate the problem as world-frame end-effector tracking and propose HiWET, a hierarchical reinforcement learning framework that decouples global reasoning from dynamic execution. The high-level policy generates subgoals that jointly optimize end-effector accuracy and base positioning in the world frame, while the low-level policy executes these commands under stability constraints. We introduce a Kinematic Manifold Prior (KMP) that embeds the manipulation manifold into the action space via residual learning, reducing exploration dimensionality and mitigating kinematically invalid behaviors. Extensive simulation and ablation studies demonstrate that HiWET achieves precise and stable end-effector tracking in long-horizon world-frame tasks. We validate zero-shot sim-to-real transfer of the low-level policy on a physical humanoid, demonstrating stable locomotion under diverse manipulation commands. These results indicate that explicit world-frame reasoning combined with hierarchical control provides an effective and scalable solution for long-horizon humanoid loco-manipulation.
Paper Structure (41 sections, 17 equations, 7 figures, 2 tables)

This paper contains 41 sections, 17 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: HiWET capabilities in simulation and real-world deployment. (a)-(c) Whole-body redundancy exploitation for diverse reaching tasks (left: simulation, right: real robot): (a) lowest and farthest, (b) highest, (c) outermost in semi-squat posture. (d) Sim-to-sim transfer to MuJoCo with world-frame trajectory tracking (red circle: target, green: actual). (e)-(f) Real-world long-exposure experiments: (e) square trajectory, (f) circular trajectory, where red curves are traced by an LED attached to the end-effector.
  • Figure 2: HiWET architecture and two-stage training procedure.Stage 1: Tracker (blue): The tracking policy learns to follow base-relative end-effector commands. Commands are sampled from a mixture of uniform random sampling and an importance-sampled dataset filtered by IK error and manipulability. A pretrained KMP provides upper-body kinematic references, which are refined by residual actions from the Actor. The History Encoder extracts temporal context, while the State Estimator reconstructs privileged information via an auxiliary estimation loss. Stage 2: Commander (orange): The command policy translates world-frame end-effector (EEF) targets (EEF Cmd. W.) and base pose into base-frame subgoals (Base Cmd. B. and EEF Cmd. B.).
  • Figure 3: Tracking success rate and mean position error across diverse geometric trajectories. HiWET demonstrates improved performance and consistency compared to its ablated versions (w/ Fixed $\alpha$, w/o State Est., w/o KMP), particularly in complex tasks like tracing stars and hearts.
  • Figure 4: Qualitative comparison of 3D world-frame trajectory tracking for heart and star shapes. The color gradient indicates the instantaneous Cartesian position error (blue: $<2$ mm, red: $>20$ mm). HiWET (first column) maintains high spatial consistency, while fixing $\alpha=1.0$ (w/ Fixed $\alpha$), removing privileged estimation (w/o State Est.), or removing the kinematic prior (w/o KMP) leads to varying degrees of deviation and oscillatory patterns. Additional geometric trajectories (circle, spiral, rectangle) are provided in the Appendix.
  • Figure 5: Analysis of base mobility and positioning precision. (Left) Top-view trajectories for 8-directional base repositioning tasks. (Right) Evaluation of final base positioning error in the $xy$-plane at the target location. HiWET provides the most stable trajectories and the highest positioning accuracy among all variants (w/ Fixed $\alpha$, w/o State Est., w/o KMP).
  • ...and 2 more figures