Table of Contents
Fetching ...

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, Ziluo Ding, Zhiyuan Xu, Lei Sun, Shanghang Zhang, Zhengping Che, Jian Tang, Badong Chen

Abstract

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Abstract

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

Paper Structure

This paper contains 19 sections, 12 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overview of HEX. (a) HEX is, to the best of our knowledge, the first whole-body VLA framework for full-sized bipedal humanoid robots, pretrained on diverse cross-embodiment humanoid trajectory data. (b) HEX combines a high-level VLA module with a low-level whole-body controller for coordinated action generation and balance-preserving execution. (c) We evaluate HEX on Tienkung 2.0 and Tienkung 3.0 across whole-body, long-horizon, and fast-reaction tasks, demonstrating strong performance across diverse manipulation scenarios.
  • Figure 2: Overview of the proposed high-level VLA policy in HEX. Given a language instruction $L$, the current visual observation $V_t$, and a history query token $Q_t$, the VLM encodes visual-language context together with lightweight temporal review cues summarized in a history cache. In parallel, humanoid-aligned proprioceptive states are organized into structured part-aware tokens and processed by a MoE-based Unified Proprioceptive Predictor, which captures whole-body interactions and forecasts future state dynamics. The resulting visual-language and predictive proprioceptive features are then integrated by the HEX Action Expert through adaptive fusion for action generation, producing task-relevant high-level actions over the prediction horizon.
  • Figure 3: Left and middle: Unified Proprioceptive Predictor (UPP). Morphology-based proprioceptive states are first mapped into canonical body-part tokens and augmented with learnable future query tokens. These spatio-temporal tokens are processed by a shared transformer backbone sandwiched by morphology-aware MoE adaptation modules, yielding future proprioceptive latents $\mathbf{H}^{p}$. The middle panel details the morphology-aware MoE, where flattened part-time tokens are routed by a learned top-$k$ gate to a set of routed experts, while a shared expert provides a common transformation across all tokens. This design enables token-wise specialization for embodiment- and part-dependent variations while preserving reusable dynamics across embodiments. Right: Action Expert. Noisy action tokens are encoded and conditioned on both visual-language features $\mathbf{H}^{VL}$ and predicted proprioceptive features $\mathbf{H}^{p}$ through dual cross-attention. A learned gate adaptively injects the state branch on top of the visual-language branch, followed by self-attention and feed-forward refinement. The resulting denoised features are decoded into high-level actions for arm and hand control and for downstream whole-body execution.
  • Figure 4: Real-Robot teleoperation data collection Setup.
  • Figure 5: Generalization tasks. Two distribution-shift variants for each of four seen tasks: Pose Mimic, Pouring, Box Carry, and Kneel Pick.
  • ...and 5 more figures