Universal Humanoid Motion Representations for Physics-Based Control
Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu
TL;DR
This paper introduces PULSE, a universal, physics-grounded motion representation for humanoid control. It achieves broad motion coverage by distilling a large-scale motion imitator into a probabilistic latent space using a variational information bottleneck and a proprioception-conditioned prior, enabling long-horizon, realistic motion. A learnable prior and residual action formulation facilitate efficient hierarchical RL, allowing the latent space to drive diverse generative and motion-tracking tasks with faster training and human-like behavior. Experiments on AMASS-derived data and VR-tracking scenarios demonstrate improved imitation quality, generation diversity, and sample efficiency compared to state-of-the-art latent-space methods, while maintaining physical plausibility. This work positions PULSE as a foundation model for control, enabling scalable reuse of rich motor skills across a wide range of humanoid tasks.
Abstract
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high dimensionality of humanoids and the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers their applicability in complex tasks. We close this gap by significantly increasing the coverage of our motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved by using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. By sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using human-like behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks (e.g. strike, terrain traversal) and motion tracking using VR controllers.
